History log of /linux-master/fs/btrfs/ctree.h
Revision Date Author Comments
# 5693a128 26-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: add forward declarations and headers, part 3

Do a cleanup in the rest of the headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>


# 91701bdf 12-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: make btrfs_error_unpin_extent_range() return void

This helper is used in transaction abort or cleanup context and the
callers cannot handle all errors, only do best effort.

btrfs_cleanup_one_transaction
btrfs_destroy_delayed_refs
btrfs_error_unpin_extent_range
btrfs_destroy_pinned_extent
btrfs_error_unpin_extent_range

Signed-off-by: David Sterba <dsterba@suse.com>


# 6140ba8a 06-Dec-2023 David Sterba <dsterba@suse.com>

btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree

The radix-tree has been superseded by the xarray
(https://lwn.net/Articles/745073), this patch converts the
btrfs_root::delayed_nodes, the APIs are used in a simple way.

First idea is to do xa_insert() but this would require GFP_ATOMIC
allocation which we want to avoid if possible. The preload mechanism of
radix-tree can be emulated within the xarray API.

- xa_reserve() with GFP_NOFS outside of the lock, the reserved entry
is inserted atomically at most once

- xa_store() under a lock, in case something races in we can detect that
and xa_load() returns a valid pointer

All uses of xa_load() must check for a valid pointer in case they manage
to get between the xa_reserve() and xa_store(), this is handled in
btrfs_get_delayed_node().

Otherwise the functionality is equivalent, xarray implements the
radix-tree and there should be no performance difference.

The patch continues the efforts started in 253bf57555e451 ("btrfs: turn
delayed_nodes_tree into an XArray") and fixes the problems with locking
and GFP flags 088aea3b97e0ae ("Revert "btrfs: turn delayed_nodes_tree
into an XArray"").

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 46524fab 20-Nov-2023 David Sterba <dsterba@suse.com>

btrfs: remove unused btrfs_root::type

Looks like the struct member was added in 2007 in 2.6.29 in commit
87ee04eb0f2f ("Btrfs: Add simple stripe size parameter") but hasn't been
used at all since. So let's remove it. This was found by tool
https://github.com/jirislaby/clang-struct, then build tested after
removing the struct member.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6e5de50f 19-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: use bool for return type of btrfs_block_can_be_shared()

Currently btrfs_block_can_be_shared() returns an int that is used as a
boolean. Since it all it needs is to return true or false, and it can't
return errors for example, change the return type from int to bool to
make it a bit more readable and obvious.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6000d931 18-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: remove log_extents_lock and logged_list from struct btrfs_root

The logged_list[2] and log_extents_lock[2] members of struct btrfs_root
are no longer used, their last use was removed in commit 5636cf7d6dc8
("btrfs: remove the logged extents infrastructure"). So remove these
fields. This reduces the size of struct btrfs_root, on a release kernel,
from 1392 bytes down to 1352 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6008859b 04-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: add and use helpers for reading and writing log_transid

Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.

Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f9850787 04-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: add and use helpers for reading and writing last_log_commit

Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6422b4cd 26-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: move btrfs_realloc_node() from ctree.c into defrag.c

btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
defrag.c file, so move it, and its helper close_blocks(), into defrag.c.

During the move also do a few minor cosmetic changes:

1) Change the return value of close_blocks() from int to bool;

2) Use SZ_32K instead of 32768 at close_blocks();

3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
'end_slot';

4) Get rid of 'parent_nritems' variable, in both places where it was
used it could be replaced by calling btrfs_header_nritems(parent);

5) Change the type of a couple variables from int to bool;

6) Rename variable 'err' to 'ret', as that's the most common name we
use to track the return value of a function;

7) Move some variables from the top scope to the scope of the for loop
where they are used.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 79d25df0 26-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: export comp_keys() from ctree.c as btrfs_comp_keys()

Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
later patch we can move out defrag specific code from ctree.c into
defrag.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 95f93bc4 26-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: rename and export __btrfs_cow_block()

Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
to allow to move defrag specific code out of ctree.c and into defrag.c in
one of the next patches.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2672a051 28-Jun-2023 Boris Burkov <boris@bur.io>

btrfs: track data relocation with simple quota

Relocation data allocations are quite tricky for simple quotas. The
basic data relocation sequence is (ignoring details that aren't relevant
to this fix):

- create a fake relocation data fs root
- create a fake relocation inode in that root
- for each data extent:
- preallocate a data extent on behalf of the fake inode
- copy over the data
- for each extent
- swap the refs so that the original file extent now refers to the new
extent item
- drop the fake root, dropping its refs on the old extents, which lets
us delete them.

Done naively, this results in storing an extent item in the extent tree
whose owner_ref points at the relocation data root and a no-op squota
recording, since the reloc root is not a legit fstree. So far, that's
OK. The problem comes when you do the swap, and leave an extent item
owned by this bogus root as the real permanent extents of the file. If
the file then drops that ref, we free it and no-op account that against
the fake relocation root. Essentially, this means that relocation is
simple quota "extent laundering", since we re-own the extents into a
fake root.

Simple quotas very intentionally doesn't have a mechanism for
transferring ownership of extents, as that is exactly the complicated
thing we are trying to avoid with the new design. Further, it cannot be
correctly done in this case, since at the time you create the new
"real" refs, there is no way to know which was the original owner before
relocation unless we track it.

Therefore, it makes more sense to trick the preallocation to handle
relocation as a special case and note the proper owner ref from the
beginning. That way, we never write out an extent item without the
correct owner ref that it will eventually have.

This could be done by wiring a special root parameter all the way
through the allocation code path, but to avoid that special case
touching all the code, take advantage of the serial nature of relocation
to store the src root on the relocation root object. Then when we finish
the prealloc, if it happens to be this case, prepare the delayed ref
appropriately.

We must also add logic to handle relocating adjacent extents with
different owning roots. Those cannot be preallocated together in a
cluster as it would lose the separate ownership information.

This is obviously a smelly bit of code, but I think it is the best
solution to the problem, given the relocation implementation.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 50564b65 12-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: abort transaction on generation mismatch when marking eb as dirty

When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2a3a1dd9 25-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: remove extraneous includes from ctree.h

We don't need any of these includes in the ctree.h header file for the
header file itself, remove them to clean up ctree.h a little bit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1b9e6a15 25-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_name_hash to dir-item.h

This is related to the name hashing for dir items, move it into
dir-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 98e4f060 25-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_extref_hash into inode-item.h

Ideally this would be un-inlined, but that is a cleanup for later. For
now move this into inode-item.h, which is where the extref code lives.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 03e86348 25-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: remove btrfs_crc32c wrapper

This simply sends the same arguments into crc32c(), and is just used in
a few places. Remove this wrapper and directly call crc32c() in these
instances.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 102f2640 25-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_crc32c_final into free-space-cache.c

This is the only place this helper is used, take it out of ctree.h and
move it into free-space-cache.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb96e221 19-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: fix unwritten extent buffer after snapshotting a new subvolume

When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
address 36007936, for the leaf/root of a new subvolume that has an ID
of 291. We mark the extent buffer as dirty, and at this point the
subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
create_subvol(). We used to, but that was recently removed in
commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
create");

3) The transaction used to create the subvolume has an ID of 33, so the
extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
files created and its tree height changes from 0 to 1, so we end up with
a new root at level 1 and the extent buffer 36007936 is now a leaf of
that new root node, which is extent buffer 36048896.

The commit root remains as 36007936, since we are still at transaction
33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
we end up at transaction.c:create_pending_snapshot(), in the critical
section of a transaction commit.

There we COW the root of subvolume 291, which is extent buffer 36048896.
The COW operation returns extent buffer 36048896, since there's no need
to COW because the extent buffer was created in this transaction and it
was not written yet.

The we call btrfs_copy_root() against the root node 36048896. During
this operation we allocate a new extent buffer to turn into the root
node of the snapshot, copy the contents of the root node 36048896 into
this snapshot root extent buffer, set the owner to 292 (the ID of the
snapshot), etc, and then we call btrfs_inc_ref(). This will create a
delayed reference for each leaf pointed by the root node with a
reference root of 292 - this includes a reference for the leaf
36007936.

After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

Then we call btrfs_insert_dir_item(), to create the directory entry in
in the tree of subvolume 291 that points to the snapshot. This ends up
needing to modify leaf 36007936 to insert the respective directory
items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
we need to COW the leaf. We end up at btrfs_force_cow_block() and then
at update_ref_for_cow().

At update_ref_for_cow() we call btrfs_block_can_be_shared() which
returns false, despite the fact the leaf 36007936 is shared - the
subvolume's root and the snapshot's root point to that leaf. The
reason that it incorrectly returns false is because the commit root
of the subvolume is extent buffer 36007936 - it was the initial root
of the subvolume when we created it. So btrfs_block_can_be_shared()
which has the following logic:

int btrfs_block_can_be_shared(struct btrfs_root *root,
struct extent_buffer *buf)
{
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
buf != root->node && buf != root->commit_root &&
(btrfs_header_generation(buf) <=
btrfs_root_last_snapshot(&root->root_item) ||
btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
return 1;

return 0;
}

Returns false (0) since 'buf' (extent buffer 36007936) matches the
root's commit root.

As a result, at update_ref_for_cow(), we don't check for the number
of references for extent buffer 36007936, we just assume it's not
shared and therefore that it has only 1 reference, so we set the local
variable 'refs' to 1.

Later on, in the final if-else statement at update_ref_for_cow():

static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct extent_buffer *buf,
struct extent_buffer *cow,
int *last_ref)
{
(...)
if (refs > 1) {
(...)
} else {
(...)
btrfs_clear_buffer_dirty(trans, buf);
*last_ref = 1;
}
}

So we mark the extent buffer 36007936 as not dirty, and as a result
we don't write it to disk later in the transaction commit, despite the
fact that the snapshot's root points to it.

Attempting to access the leaf or dumping the tree for example shows
that the extent buffer was not written:

$ btrfs inspect-internal dump-tree -t 292 /dev/sdb
btrfs-progs v6.2.2
file tree key (292 ROOT_ITEM 33)
node 36110336 level 1 items 2 free space 119 generation 33 owner 292
node 36110336 flags 0x1(WRITTEN) backref revision 1
checksum stored a8103e3e
checksum calced a8103e3e
fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
key (256 INODE_ITEM 0) block 36007936 gen 33
key (257 EXTENT_DATA 0) block 36052992 gen 33
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
total bytes 107374182400
bytes used 38572032
uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

The respective on disk region is full of zeroes as the device was
trimmed at mkfs time.

Obviously 'btrfs check' also detects and complains about this:

$ btrfs check /dev/sdb
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
generation: 33 (33)
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
owner ref check failed [36007936 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
The following tree block(s) is corrupted in tree 292:
tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
root 292 root dir 256 not found
ERROR: errors found in fs roots
found 38572032 bytes used, error(s) found
total csum bytes: 16048
total tree bytes: 1265664
total fs tree bytes: 1118208
total extent tree bytes: 65536
btree space waste bytes: 562598
file data blocks allocated: 65978368
referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

$ cat test.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

# Use a filesystem with a 64K node size so that we have the same node
# size on every machine regardless of its page size (on x86_64 default
# node size is 16K due to the 4K page size, while on PPC it's 64K by
# default). This way we can make sure we are able to create a btree for
# the subvolume with a height of 2.
mkfs.btrfs -f -n 64K $DEV
mount $DEV $MNT

btrfs subvolume create $MNT/subvol

# Create a few empty files on the subvolume, this bumps its btree
# height to 2 (root node at level 1 and 2 leaves).
for ((i = 1; i <= 300; i++)); do
echo -n > $MNT/subvol/file_$i
done

btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

umount $DEV

btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

$ ./test.sh
Create subvolume '/mnt/sdi/subvol'
Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
owner ref check failed [30539776 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
Wrong generation of child node/leaf, wanted: 5, have: 7
root 257 root dir 256 not found
ERROR: errors found in fs roots
found 917504 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 851968
total fs tree bytes: 393216
total extent tree bytes: 65536
btree space waste bytes: 736550
file data blocks allocated: 0
referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9b378f6a 12-Aug-2023 Filipe Manana <fdmanana@suse.com>

btrfs: fix infinite directory reads

The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.

The following C program and test script reproduce the problem:

$ cat /mnt/readdir_prog.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
DIR *dir = opendir(".");
struct dirent *dd;

while ((dd = readdir(dir))) {
printf("%s\n", dd->d_name);
rename(dd->d_name, "TEMPFILE");
rename("TEMPFILE", dd->d_name);
}
closedir(dir);
}

$ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c

$ cat test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV &> /dev/null
#mkfs.xfs -f $DEV &> /dev/null
#mkfs.ext4 -F $DEV &> /dev/null

mount $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= 2000; i++)); do
echo -n > $MNT/testdir/file_$i
done

cd $MNT/testdir
/mnt/readdir_prog

cd /mnt

umount $MNT

This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.

So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.

Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 751a2761 08-Jun-2023 Filipe Manana <fdmanana@suse.com>

btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()

At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record
tree mod log operations, do a transaction abort and return the error to
the callers. There's really no need for the BUG_ON() as we can release all
resources in the context of all callers, and we have to abort because other
future tree searches that use the tree mod log (btrfs_search_old_slot())
may get inconsistent results if other operations modify the tree after
that failure and before the tree mod log based search.

This implies btrfs_del_ptr() return an int instead of void, and making all
callers check for returned errors.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 016f9d0b 29-Apr-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: rename del_ptr to btrfs_del_ptr and export it

This exists internal to ctree.c, however btrfs check needs to use it for
some of its operations. I'd rather not duplicate that code inside of
btrfs check as this is low level and I want to keep this code in one
place, so rename the function to btrfs_del_ptr and export it so that it
can be used inside of btrfs-progs safely. Add a comment to make sure
this doesn't get removed by a future cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b3cbfb0d 29-Apr-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: add a btrfs_csum_type_size helper

This is needed in btrfs-progs for the tools that convert the checksum
types for file systems and a few other things. We don't have it in the
kernel as we just want to get the size for the super blocks type.
However I don't want to have to manually add this every time we sync
ctree.c into btrfs-progs, so add the helper in the kernel with a note so
it doesn't get removed by a later cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6c75a589 27-Apr-2023 Qu Wenruo <wqu@suse.com>

btrfs: print-tree: pass const extent buffer pointer

Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f469c8bd 12-Apr-2023 Filipe Manana <fdmanana@suse.com>

btrfs: unexport btrfs_prev_leaf()

btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fdf8d595 23-Feb-2023 Anand Jain <anand.jain@oracle.com>

btrfs: open code btrfs_bin_search()

btrfs_bin_search() is a simple wrapper that searches for the whole slots
by calling btrfs_generic_bin_search() with the starting slot/first_slot
preset to 0.

This simple wrapper can be open coded as btrfs_bin_search().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a724f313 08-Feb-2023 Filipe Manana <fdmanana@suse.com>

btrfs: do unsigned integer division in the extent buffer binary search loop

In the search loop of the binary search function, we are doing a division
by 2 of the sum of the high and low slots. Because the slots are integers,
the generated assembly code for it is the following on x86_64:

0x00000000000141f1 <+145>: mov %eax,%ebx
0x00000000000141f3 <+147>: shr $0x1f,%ebx
0x00000000000141f6 <+150>: add %eax,%ebx
0x00000000000141f8 <+152>: sar %ebx

It's a few more instructions than a simple right shift, because signed
integer division needs to round towards zero. However we know that slots
can never be negative (btrfs_header_nritems() returns an u32), so we
can instead use unsigned types for the low and high slots and therefore
use unsigned integer division, which results in a single instruction on
x86_64:

0x00000000000141f0 <+144>: shr %ebx

So use unsigned types for the slots and therefore unsigned division.

This is part of a small patchset comprised of the following two patches:

btrfs: eliminate extra call when doing binary search on extent buffer
btrfs: do unsigned integer division in the extent buffer binary search loop

The following fs_mark test was run on a non-debug kernel (Debian's default
kernel config) before and after applying the patchset:

$ cat test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

OPTS="-S 0 -L 6 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done

fs_mark $OPTS

umount $MNT

Results before applying patchset:

FSUse% Count Size Files/sec App Overhead
2 1200000 0 174472.0 11549868
4 2400000 0 253503.0 11694618
4 3600000 0 257833.1 11611508
6 4800000 0 247089.5 11665983
6 6000000 0 211296.1 12121244
10 7200000 0 187330.6 12548565

Results after applying patchset:

FSUse% Count Size Files/sec App Overhead
2 1200000 0 207556.0 11393252
4 2400000 0 266751.1 11347909
4 3600000 0 274397.5 11270058
6 4800000 0 259608.4 11442250
6 6000000 0 238895.8 11635921
8 7200000 0 211942.2 11873825

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7b00dfff 08-Feb-2023 Filipe Manana <fdmanana@suse.com>

btrfs: eliminate extra call when doing binary search on extent buffer

The function btrfs_bin_search() is just a wrapper around the function
generic_bin_search(), which passes the same arguments plus a default
low slot with a value of 0. This adds an unnecessary extra function
call, since btrfs_bin_search() is not static. So improve on this by
making btrfs_bin_search() an inline function that calls
generic_bin_search(), renaming the later to btrfs_generic_bin_search()
and exporting it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0e6c40eb 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the csum helpers into ctree.h

These got moved because of copy+paste, but this code exists in ctree.c,
so move the declarations back into ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9b48adda 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move eb offset helpers into extent_io.h

These are very specific to how the extent buffer is defined, so this
differs between btrfs-progs and the kernel. Make things easier by
moving these helpers into extent_io.h so we don't have to worry about
this when syncing ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6bfd0ffa 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move file_extent_item helpers into file-item.h

These helpers use functions that are in multiple places, which makes it
tricky to sync them into btrfs-progs. Move them to file-item.h and then
include file-item.h in places that use these helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1fe5ebc4 15-Nov-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move root helpers back into ctree.h

These accidentally got brought into accessors.h, but belong with the
btrfs_root definitions which are currently in ctree.h. Move these to
make it easier to sync accessors.[ch] into btrfs-progs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3c32c721 11-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use cached state when looking for delalloc ranges with lseek

During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent,
we will look for delalloc in that range, and one of the things we do for
that is to find out ranges in the inode's io_tree marked with
EXTENT_DELALLOC, using calls to count_range_bits().

Typically there's a single, or few, searches in the io_tree for delalloc
per lseek call. However it's common for applications to keep calling
lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in
a file, read the extents and skip holes in order to avoid unnecessary IO
and save disk space by preserving holes.

One popular user is the cp utility from coreutils. Starting with coreutils
9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a
file. Before 9.0, it used fiemap to figure out where holes and extents are
in the source file. Another popular user is the tar utility when used with
the --sparse / -S option to detect and preserve holes.

Given that the pattern is to keep calling lseek with a start offset that
matches the returned offset from the previous lseek call, we can benefit
from caching the last extent state visited in count_range_bits() and use
it for the next count_range_bits() from the next lseek call. Example,
the following strace excerpt from running tar:

$ strace tar cJSvf foo.tar.xz qemu_disk_file.raw
(...)
lseek(5, 125019574272, SEEK_HOLE) = 125024989184
lseek(5, 125024989184, SEEK_DATA) = 125024993280
lseek(5, 125024993280, SEEK_HOLE) = 125025239040
lseek(5, 125025239040, SEEK_DATA) = 125025255424
lseek(5, 125025255424, SEEK_HOLE) = 125025353728
lseek(5, 125025353728, SEEK_DATA) = 125025357824
lseek(5, 125025357824, SEEK_HOLE) = 125026766848
lseek(5, 125026766848, SEEK_DATA) = 125026770944
lseek(5, 125026770944, SEEK_HOLE) = 125027053568
(...)

Shows that pattern, which is the same as with cp from coreutils 9.0+.

So start using a cached state for the delalloc searches in lseek, and
store it in struct file's private data so that it can be reused across
lseek calls.

This change is part of a patchset that is comprised of the following
patches:

1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek

The following test was run before and after applying the whole patchset:

$ cat test-cp.sh
#!/bin/bash

DEV=/dev/sdh
MNT=/mnt/sdh

# coreutils 8.32, cp uses fiemap to detect holes and extents
#CP_PROG=/usr/bin/cp
# coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents
CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp

umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT

FILE_SIZE=$((1024 * 1024 * 1024))
echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M"
# Create a very sparse file, where each extent has a length of 4K and
# is preceded by a 4K hole and followed by another 4K hole.
start=$(date +%s%N)
echo -n > $MNT/foobar
for ((off = 0; off < $FILE_SIZE; off += 8192)); do
xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
echo -ne "\r$off / $FILE_SIZE ..."
done
end=$(date +%s%N)
echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)"

start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds with data/metadata cached and delalloc"

# Flush all delalloc.
sync

start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds with data/metadata cached and no delalloc"

# Unmount and mount again to test the case without any metadata
# loaded in memory.
umount $MNT
mount $DEV $MNT

start=$(date +%s%N)
$CP_PROG $MNT/foobar /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "cp took $dur milliseconds without data/metadata cached and no delalloc"

umount $MNT

The results, running on a box with a non-debug kernel (Debian's default
kernel config), were the following:

128M file, before patchset:

cp took 16574 milliseconds with data/metadata cached and delalloc
cp took 122 milliseconds with data/metadata cached and no delalloc
cp took 20144 milliseconds without data/metadata cached and no delalloc

128M file, after patchset:

cp took 6277 milliseconds with data/metadata cached and delalloc
cp took 109 milliseconds with data/metadata cached and no delalloc
cp took 210 milliseconds without data/metadata cached and no delalloc

512M file, before patchset:

cp took 14369 milliseconds with data/metadata cached and delalloc
cp took 429 milliseconds with data/metadata cached and no delalloc
cp took 88034 milliseconds without data/metadata cached and no delalloc

512M file, after patchset:

cp took 12106 milliseconds with data/metadata cached and delalloc
cp took 427 milliseconds with data/metadata cached and no delalloc
cp took 824 milliseconds without data/metadata cached and no delalloc

1G file, before patchset:

cp took 10074 milliseconds with data/metadata cached and delalloc
cp took 886 milliseconds with data/metadata cached and no delalloc
cp took 181261 milliseconds without data/metadata cached and no delalloc

1G file, after patchset:

cp took 3320 milliseconds with data/metadata cached and delalloc
cp took 880 milliseconds with data/metadata cached and no delalloc
cp took 1801 milliseconds without data/metadata cached and no delalloc

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# aa5d3003 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move orphan prototypes into orphan.h

Move these out of ctree.h into orphan.h to cut down on code in ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c03b2207 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move super prototypes into super.h

Move these out of ctree.h into super.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6a6b4daf 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move CONFIG_BTRFS_FS_RUN_SANITY_TESTS checks to fs.h

We already have a few of these in fs.h, move the remaining checks out of
ctree.h into fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c11adcc 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move verity prototypes into verity.h

Move these out of ctree.h into verity.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 77407dc0 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move dev-replace prototypes into dev-replace.h

We already have a dev-replace.h, simply move these prototypes and
helpers into dev-replace.h where they belong.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2fc6822c 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move scrub prototypes into scrub.h

Move these out of ctree.h into scrub.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 67707479 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move relocation prototypes into relocation.h

Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33cf97a7 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move acl prototypes into acl.h

Move these out of ctree.h into acl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cc68414c 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the snapshot drop related prototypes to extent-tree.h

These belong in extent-tree.h, they were missed because they were not
grouped with the other extent-tree.c prototypes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b538a271 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the 32bit warn defines into messages.h

The code for these functions are in messages.c, move the defines and
prototypes to messages.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af142b6f 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move file prototypes to file.h

Move these out of ctree.h into file.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7572dec8 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move ioctl prototypes into ioctl.h

Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c7a03b52 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move uuid tree prototypes to uuid-tree.h

Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7c8ede16 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move file-item prototypes into their own header

Move these prototypes out of ctree.h and into file-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f2b39277 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move dir-item prototypes into dir-item.h

Move these prototypes out of ctree.h and into their own header file.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 59b818e0 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move defrag related prototypes to their own header

Now that the defrag code is all in one file, create a defrag.h and move
all the defrag related prototypes and helper out of ctree.h and into
defrag.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2885fd63 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move inode prototypes to btrfs_inode.h

I initially wanted to make a new header file for this, but these
prototypes do naturally fit into btrfs_inode.h. If we want to extract
vfs from pure btrfs code in the future we may need to split this up, but
btrfs_inode embeds the vfs_inode, so it makes sense to put the
prototypes in this header for now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b31bed17 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_chunk_item_size out of ctree.h

This is used by the volumes code and the tree checker code. We want to
maintain inline however, so simply move it to volumes.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 911bd75a 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove unused function prototypes

I wrote the following coccinelle script to find function declarations
that didn't have the corresponding code for them

@funcproto@
identifier func;
type T;
position p0;
@@

T func@p0(...);

@funccode@
identifier funcproto.func;
position p1;
@@

func@p1(...) { ... }

@script:python depends on !funccode@
p0 << funcproto.p0;
@@
print("Proto with no function at %s:%s" % (p0[0].file, p0[0].line))

and ran it against btrfs, which identified the 4 function prototypes
I've removed in this patch.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 45c40c8f 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move root tree prototypes to their own header

Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6d2049a2 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: delete unused function prototypes in ctree.h

This batch of prototypes no longer have code associated with them, so
remove them.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2839c2c1 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move delalloc space related prototypes to delalloc-space.h

These exist in delalloc-space.c, move them from ctree.h into
delalloc-space.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a0231804 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent-tree helpers into their own header file

Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e2f13b34 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_account_ro_block_groups_free_space into space-info.c

This was prototyped in ctree.h and the code existed in extent-tree.c,
but it's space-info related so move it into space-info.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8483d402 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove extra space info prototypes in ctree.h

These are defined already in space-info.h, remove them from ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 13d925c1 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: minor whitespace in ctree.h

We've accumulated some whitespace problems in ctree.h, clean these up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb33a4d6 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the lockdep helpers into locking.h

These more naturally fit in with the locking related code, and they're
all defines so they can easily go anywhere, move them out of ctree.h
into locking.h

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a56159d4 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_fs_info declarations into fs.h

Now that we have a lot of the fs_info related helpers and stuff
isolated, copy these over to fs.h out of ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 6db75318 19-Oct-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: use struct fscrypt_str instead of struct qstr

While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab3c5c18 19-Oct-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: setup qstr from dentrys using fscrypt helper

Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.

At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e43eec81 19-Oct-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: use struct qstr instead of name and namelen pairs

Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.

Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.

This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e9c83077 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove temporary btrfs_map_token declaration in ctree.h

This was added while I was moving this code to its new home, it can be
removed now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 07e81dc9 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move accessor helpers into accessors.h

This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ad1ac501 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_map_token to accessors

This is specific to the item-accessor code, move it out of ctree.h into
accessor.h/.c and then update the users to include the new header file.
This un-inlines btrfs_init_map_token, however this is only called once
per function so it's not critical to be inlined. This also saves 904
bytes of code on a release build.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d83eb482 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the compat/incompat flag masks to fs.h

This is fs wide information, move it out of ctree.h into fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55e5cfd3 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove fs_info::pending_changes and related code

Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.

We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.

[1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 7966a6b5 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move fs_info::flags enum to fs.h

These definitions are fs wide, take them out of ctree.h and put them in
fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc97a410 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move mount option definitions to fs.h

These are fs wide definitions and helpers, move them out of ctree.h and
into fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ec8eb376 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h

We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9b569ea0 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the printk helpers out of ctree.h

We have a bunch of printk helpers that are in ctree.h. These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e118578a 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move assert helpers out of ctree.h

These call functions that aren't defined in, or will be moved out of,
ctree.h Move them to super.c where the other assert/error message code
is defined. Drop the __noreturn attribute for btrfs_assertfail as
objtool does not like it and fails with warnings like

fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section
fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold()
fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c7f13d42 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move fs wide helpers out of ctree.h

We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 63a7cb13 26-Jul-2022 David Sterba <dsterba@suse.com>

btrfs: auto enable discard=async when possible

There's a request to automatically enable async discard for capable
devices. We can do that, the async mode is designed to wait for larger
freed extents and is not intrusive, with limits to iops, kbps or latency.

The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .

The automatic selection is done if there's at least one discard capable
device in the filesystem (not capable devices are skipped). Mounting
with any other discard option will honor that option, notably mounting
with nodiscard will keep it disabled.

Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# c7321b76 09-Sep-2022 David Sterba <dsterba@suse.com>

btrfs: convert BTRFS_ILOCK-* defines to enum bit

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7a66eda3 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h

These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
without verity enabled. Move these definitions up with the other
accessor helpers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 890d2b1a 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_next_old_item into ctree.c

This uses btrfs_header_nritems, which I will be moving out of ctree.h.
In order to avoid needing to include the relevant header in ctree.h,
simply move this helper function into ctree.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename parameters ]
Signed-off-by: David Sterba <dsterba@suse.com>


# eda517fd 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move free space cachep's out of ctree.h

This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 226463d7 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_path_cachep out of ctree.h

This is local to the ctree code, remove it from ctree.h and inode.c,
create new init/exit functions for the cachep, and move it locally to
ctree.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 956504a3 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move trans_handle_cachep out of ctree.h

This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f1e5c618 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move flush related definitions to space-info.h

This code is used in space-info.c, move the definitions to space-info.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 390d89cc 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move discard stat defs to free-space-cache.h

These definitions are used for discard statistics, move them out of
ctree.h and put them in free-space-cache.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed4c491a 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move BTRFS_MAX_MIRRORS into scrub.c

This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ad4b63ca 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move maximum limits to btrfs_tree.h

We have maximum link and name length limits, move these to btrfs_tree.h
as they're on disk limitations.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4300c58f 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs on-disk definitions out of ctree.h

The bulk of our on-disk definitions exist in btrfs_tree.h, which user
space can use. Keep things consistent and move the rest of the on disk
definitions out of ctree.h into btrfs_tree.h. Note I did have to update
all u8's to __u8, but otherwise this is a strict copy and paste.

Most of the definitions are mainly for internal use and are not
guaranteed stable public API and may change as we need. Compilation
failures by user applications can happen.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ce76e8e 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove unused BTRFS_IOPRIO_READA

The last user of this definition was removed in patch f26c92386028
("btrfs: remove reada infrastructure") so we can remove this definition.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ea206640 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove unused BTRFS_TOTAL_BYTES_PINNED_BATCH

This hasn't been used since 138a12d86574 ("btrfs: rip out
btrfs_space_info::total_bytes_pinned") so it is safe to remove.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d60d956e 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove unused set/clear_pending_info helpers

The last users of these helpers were removed in 5297199a8bca ("btrfs:
remove inode number cache feature") so delete these helpers.

The point was for mount options that were applicable after transaction
commit so they could not be applied immediately. We don't have such
options anymore and if we do the patch can be reverted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 138060ba 23-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: pass dentry to set acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 8bb808c6 03-Nov-2022 David Sterba <dsterba@suse.com>

btrfs: don't print stack trace when transaction is aborted due to ENOMEM

Add ENOMEM among the error codes that don't print stack trace on
transaction abort. We've got several reports from syzbot that detects
stacks as errors but caused by limiting memory. As this is an artificial
condition we don't need to know where exactly the error happens, the
abort and error cleanup will continue like e.g. for EIO.

As the transaction aborts code needs to be inline in a lot of code, the
implementation cases about minimal bloat. The error codes are in a
separate function and the WARN uses the condition directly. This
increases the code size by 571 bytes on release build.

Alternatives considered: add -ENOMEM among the errors, this increases
size by 2340 bytes, various attempts to combine the WARN and helper
calls, increase by 700 or more bytes.

Example syzbot reports (error -12):

- https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e
- https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417

Signed-off-by: David Sterba <dsterba@suse.com>


# 8184620a 28-Oct-2022 Filipe Manana <fdmanana@suse.com>

btrfs: fix lost file sync on direct IO write with nowait and dsync iocb

When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.

This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().

For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).

So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.

A test case will later be sent for fstests, after this is fixed in Linus'
tree.

Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c0c8cfc 19-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: move btrfs_drop_extent_cache() to extent_map.c

The function btrfs_drop_extent_cache() doesn't really belong at file.c
because what it does is drop a range of extent maps for a file range.
It directly allocates and manipulates extent maps, by dropping,
splitting and replacing them in an extent map tree, so it should be
located at extent_map.c, where all manipulations of an extent map tree
and its extent maps are supposed to be done.

So move it out of file.c and into extent_map.c. Additionally do the
following changes:

1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
clear about what it does. The term "cache" is a bit confusing as it's
not widely used, "extent maps" or "extent mapping" is much more common;

2) Change its 'skip_pinned' argument from int to bool;

3) Turn several of its local variables from int to bool, since they are
used as booleans;

4) Move the declaration of some variables out of the function's main
scope and into the scopes where they are used;

5) Remove pointless assignment of false to 'modified' early in the while
loop, as later that variable is set and it's not used before that
second assignment;

6) Remove checks for NULL before calling free_extent_map().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3050dfa6 22-Sep-2022 Jeff Layton <jlayton@kernel.org>

btrfs: remove stale prototype of btrfs_write_inode

This function no longer exists, was removed in 3c4276936f6f ("Btrfs: fix
btrfs_write_inode vs delayed iput deadlock").

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 80f9d241 12-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: make btrfs_check_nocow_lock nowait compatible

Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
a nowait flag to btrfs_check_nocow_lock so it can be used by the write
path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 26ce9114 12-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: make can_nocow_extent nowait compatible

If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 857bc13f 12-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: implement a nowait option for tree searches

For NOWAIT IOCBs we'll need a way to tell search to not wait on locks
or anything. Accomplish this by adding a path->nowait flag that will
use trylocks and skip reading of metadata, returning -EAGAIN in either
of these cases. For now we only need this for reads, so only the read
side is handled. Add an ASSERT() to catch anybody trying to use this
for writes so they know they'll have to implement the write side.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d9d88fde 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move the fs_info related helpers closer to fs_info in ctree.h

This is purely cosmetic, to make it straightforward to copy and paste
the definition and helpers from ctree.h into fs.h. These are helpers
that act directly on the fs_info, and were scattered throughout ctree.h.
Move them directly below the fs_info definition to make it easier to
move them later. This includes the exclop prototypes, which shares an
enum that's used in struct btrfs_fs_info as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f119553f 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_csum_ptr to inode.c

This helper is only used in inode.c, move it locally to that file
instead of defining it in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0e75f005 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move fs_info forward declarations to the top of ctree.h

In order to make it more straightforward to move the fs_info struct and
it's related structures, move the struct declarations to the top of
ctree.h. This will make it easier to clean up after the fact.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2103da3b 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_swapfile_pin into volumes.h

This isn't a great spot for this, but one of the swapfile helper
functions is in volumes.c, so move the struct to volumes.h. In the
future when we have better separation of code there will be a more
natural spot for this.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2e79e86 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_pinned_by_swapfile prototype into volumes.h

This is defined in volumes.c, move the prototype into volumes.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43712116 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_init_async_reclaim_work prototype to space-info.h

The code for this helper is in space-info.c, move the prototype to
space-info.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c29abab4 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_full_stripe_locks_tree into block-group.h

This is actually embedded in struct btrfs_block_group, so move this
definition to block-group.h, and then open-code the init of the tree
where we init the rest of the block group instead of using a helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 16708a88 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_caching_type to block-group.h

This is a block group related definition, move it into block-group.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6ea1a526 09-Sep-2022 Gaosheng Cui <cuigaosheng1@huawei.com>

btrfs: remove btrfs_bit_radix_cachep declaration

btrfs_bit_radix_cachep has been removed since
commit 45c06543afe2 ("Btrfs: remove unused btrfs_bit_radix slab"),
so remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 011b46c3 23-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()

Btrfs qgroup has a long history of bringing performance penalty in
btrfs_commit_transaction().

Although we tried our best to migrate such impact, there is still an
unsolved call site, btrfs_drop_snapshot().

This function will find the highest shared tree block and modify its
extent ownership to do a subvolume/snapshot dropping.

Such change will affect the whole subtree, and cause tons of qgroup
dirty extents and stall btrfs_commit_transaction().

To avoid such problem, here we introduce a new sysfs interface,
/sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at
whether and at which level we should skip qgroup accounting for subtree
dropping.

The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go
through qgroup accounting, to ensure qgroup numbers are kept as
consistent as possible.

While for performance sensitive cases, add a way to change the values to
more reasonable values like 3, to make any subtree, which is at or higher
than level 3, to mark qgroup inconsistent and skip the accounting.

The cost is obvious, the qgroup number is no longer consistent, but at
least performance is more reasonable, and users have the control.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ac3c0d36 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: make fiemap more efficient and accurate reporting extent sharedness

The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.

To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:

1) Before entering the main loop, we call get_extent_skip_holes() to get
the first extent map. This leads us to btrfs_get_extent_fiemap(), which
in turn calls btrfs_get_extent(), to find the first extent map that
covers the file range [0, LLONG_MAX).

btrfs_get_extent() will first search the inode's extent map tree, to
see if we have an extent map there that covers the range. If it does
not find one, then it will search the inode's subvolume b+tree for a
fitting file extent item. After finding the file extent item, it will
allocate an extent map, fill it in with information extracted from the
file extent item, and add it to the inode's extent map tree (which
requires a search for insertion in the tree).

2) Then we enter the main loop at extent_fiemap(), emit the details of
the extent, and call again get_extent_skip_holes(), with a start
offset matching the end of the extent map we previously processed.

We end up at btrfs_get_extent() again, will search the extent map tree
and then search the subvolume b+tree for a file extent item if we could
not find an extent map in the extent tree. We allocate an extent map,
fill it in with the details in the file extent item, and then insert
it into the extent map tree (yet another search in this tree).

3) The second step is repeated over and over, until we have processed the
whole file range. Each iteration ends at btrfs_get_extent(), which
does a red black tree search on the extent map tree, then searches the
subvolume b+tree, allocates an extent map and then does another search
in the extent map tree in order to insert the extent map.

In the best scenario we have all the extent maps already in the extent
tree, and so for each extent we do a single search on a red black tree,
so we have a complexity of O(n log n).

In the worst scenario we don't have any extent map already loaded in
the extent map tree, or have very few already there. In this case the
complexity is much higher since we do:

- A red black tree search on the extent map tree, which has O(log n)
complexity, initially very fast since the tree is empty or very
small, but as we end up allocating extent maps and adding them to
the tree when we don't find them there, each subsequent search on
the tree gets slower, since it's getting bigger and bigger after
each iteration.

- A search on the subvolume b+tree, also O(log n) complexity, but it
has items for all inodes in the subvolume, not just items for our
inode. Plus on a filesystem with concurrent operations on other
inodes, we can block doing the search due to lock contention on
b+tree nodes/leaves.

- Allocate an extent map - this can block, and can also fail if we
are under serious memory pressure.

- Do another search on the extent maps red black tree, with the goal
of inserting the extent map we just allocated. Again, after every
iteration this tree is getting bigger by 1 element, so after many
iterations the searches are slower and slower.

- We will not need the allocated extent map anymore, so it's pointless
to add it to the extent map tree. It's just wasting time and memory.

In short we end up searching the extent map tree multiple times, on a
tree that is growing bigger and bigger after each iteration. And
besides that we visit the same leaf of the subvolume b+tree many times,
since a leaf with the default size of 16K can easily have more than 200
file extent items.

This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.

Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.

This reproducer triggers that bug:

$ cat fiemap-bug.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj

mkfs.btrfs -f $DEV
mount $DEV $MNT

# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo

# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo

umount $MNT

Running the reproducer:

$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This patch is part of a larger patchset that is comprised of the following
patches:

btrfs: allow hole and data seeking to be interruptible
btrfs: make hole and data seeking a lot more efficient
btrfs: remove check for impossible block start for an extent map at fiemap
btrfs: remove zero length check when entering fiemap
btrfs: properly flush delalloc when entering fiemap
btrfs: allow fiemap to be interruptible
btrfs: rename btrfs_check_shared() to a more descriptive name
btrfs: speedup checking for extent sharedness during fiemap
btrfs: skip unnecessary extent buffer sharedness checks during fiemap
btrfs: make fiemap more efficient and accurate reporting extent sharedness

The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.

The following test for a large compressed file without holes:

$ cat fiemap-perf-test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar

umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

Before patchset:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)

After patchset:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1214 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 684 milliseconds (metadata cached)

That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).

The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:

$ cat pavels-test.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>

#include <sys/stat.h>
#include <sys/time.h>
#include <sys/ioctl.h>

#include <linux/fs.h>
#include <linux/fiemap.h>

#define FILE_INTERVAL (1<<13) /* 8Kb */

long long interval(struct timeval t1, struct timeval t2)
{
long long val = 0;
val += (t2.tv_usec - t1.tv_usec);
val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
return val;
}

int main(int argc, char **argv)
{
struct fiemap fiemap = {};
struct timeval t1, t2;
char data = 'a';
struct stat st;
int fd, off, file_size = FILE_INTERVAL;

if (argc != 3 && argc != 2) {
printf("usage: %s <path> [size]\n", argv[0]);
return 1;
}

if (argc == 3)
file_size = atoi(argv[2]);
if (file_size < FILE_INTERVAL)
file_size = FILE_INTERVAL;
file_size -= file_size % FILE_INTERVAL;

fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}

for (off = 0; off < file_size; off += FILE_INTERVAL) {
if (pwrite(fd, &data, 1, off) != 1) {
perror("pwrite");
close(fd);
return 1;
}
}

if (ftruncate(fd, file_size)) {
perror("ftruncate");
close(fd);
return 1;
}

if (fstat(fd, &st) < 0) {
perror("fstat");
close(fd);
return 1;
}

printf("size: %ld\n", st.st_size);
printf("actual size: %ld\n", st.st_blocks * 512);

fiemap.fm_length = FIEMAP_MAX_OFFSET;
gettimeofday(&t1, NULL);
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
perror("fiemap");
close(fd);
return 1;
}
gettimeofday(&t2, NULL);

printf("fiemap: fm_mapped_extents = %d\n",
fiemap.fm_mapped_extents);
printf("time = %lld us\n", interval(t1, t2));

close(fd);
return 0;
}

$ gcc -o pavels_test pavels_test.c

And the wrapper shell script:

$ cat fiemap-pavels-test.sh

#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f -O no-holes $DEV
mount $DEV $MNT

echo
echo "*********** 256M ***********"
echo

./pavels-test $MNT/testfile $((1 << 28))
echo
./pavels-test $MNT/testfile $((1 << 28))

echo
echo "*********** 512M ***********"
echo

./pavels-test $MNT/testfile $((1 << 29))
echo
./pavels-test $MNT/testfile $((1 << 29))

echo
echo "*********** 1G ***********"
echo

./pavels-test $MNT/testfile $((1 << 30))
echo
./pavels-test $MNT/testfile $((1 << 30))

umount $MNT

Running his reproducer before applying the patchset:

*********** 256M ***********

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4003133 us

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4895330 us

*********** 512M ***********

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 30123675 us

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 33450934 us

*********** 1G ***********

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 224924074 us

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 217239242 us

Running it after applying the patchset:

*********** 256M ***********

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29475 us

size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29307 us

*********** 512M ***********

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 58996 us

size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 59115 us

*********** 1G ***********

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 116251
time = 124141 us

size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 119387 us

The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).

For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.

For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.

For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.

Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 12a824dc 01-Sep-2022 Filipe Manana <fdmanana@suse.com>

btrfs: speedup checking for extent sharedness during fiemap

One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:

1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;

2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.

During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.

For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.

On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.

This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.

Example performance test:

$ cat fiemap-perf-test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

Before this patch:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)

After this patch:

$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)

That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.

Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.

Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1c56ab99 08-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2

The problem of long mount time caused by block group item search is
already known for some time, and the solution of block group tree has
been proposed.

There is really no need to bound this feature into extent tree v2, just
introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the
problem.

All the code handling block group root is already in the upstream
kernel, thus this patch really only needs to introduce the new compat RO
flag.

This patch introduces one extra artificial limitation on block group
tree feature, that free space cache v2 and no-holes feature must be
enabled to use this new compat RO feature.

This artificial requirement is mostly to reduce the test combinations,
and can be a guideline for future features, to mostly rely on the latest
default features.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 14033b08 08-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: don't save block group root into super block

The extent tree v2 needs a new root for storing all block group items,
the whole feature hasn't been finished yet so we can afford to do some
changes.

My initial proposal years ago just added a new tree rootid, and load it
from tree root, just like what we did for quota/free space tree/uuid/extent
roots.

But the extent tree v2 patches introduced a completely new way to store
block group tree root into super block which is arguably wasteful.

Currently there are only 3 trees stored in super blocks, and they all
have their valid reasons:

- Chunk root
Needed for bootstrap.

- Tree root
Really the entry point for all trees.

- Log root
This is special as log root has to be updated out of existing
transaction mechanism.

There is not even any reason to put block group root into super blocks,
the block group tree is updated at the same time as the old extent tree,
no need for extra bootstrap/out-of-transaction update.

So just move block group root from super block into tree root.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e327b9c 25-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: dump all space infos if we abort transaction due to ENOSPC

We have hit some transaction abort due to -ENOSPC internally.

Normally we should always reserve enough space for metadata for every
transaction, thus hitting -ENOSPC should really indicate some cases we
didn't expect.

But unfortunately current error reporting will only give a kernel
warning and stack trace, not really helpful to debug what's causing the
problem.

And mount option debug_enospc can only help when user can reproduce the
problem, but under most cases, such transaction abort by -ENOSPC is
really hard to reproduce.

So this patch will dump all space infos (data, metadata, system) when we
abort the first transaction with -ENOSPC.

This should at least provide some clue to us.

The example of a dump would look like this:

BTRFS: Transaction aborted (error -28)
WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
<call trace skipped>
---[ end trace 0000000000000000 ]---
BTRFS info (device dm-1: state A): dumping space info:
BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
BTRFS info (device dm-1: state EA): forced readonly

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2bbc72f1 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: don't take a bio_counter reference for cloned bios

Stop grabbing an extra bio_counter reference for each clone bio in a
mirrored write and instead just release the one original reference in
btrfs_end_bioc once all the bios for a single btrfs_bio have completed
instead of at the end of btrfs_submit_bio once all bios have been
submitted.

This means the reference is now carried by the "upper" btrfs_bio only
instead of each lower bio.

Also remove the now unused btrfs_bio_counter_inc_noblocked helper.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb731430 25-Jul-2022 David Sterba <dsterba@suse.com>

btrfs: sysfs: show discard stats and tunables in non-debug build

When discard=async was introduced there were also sysfs knobs and stats
for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults
have been set and so far seem to satisfy all users on a range of
workloads. As there are not only tunables (like iops or kbps) but also
stats tracking amount of discardable bytes, that should be available
when the async discard is on (otherwise it's not).

The stats are moved from the per-fs debug directory, so it's under
/sys/fs/btrfs/FSID/discard

- discard_bitmap_bytes - amount of discarded bytes from data tracked as
bitmaps
- discard_extent_bytes - dtto but as extents
- discard_bytes_saved -
- discardable_bytes - amount of bytes that can be discarded
- discardable_extents - number of extents to be discarded
- iops_limit - tunable limit of number of discard IOs to be issued
- kbps_limit - tunable limit of kilobytes per second issued as discard IO
- max_discard_size - tunable limit for size of one IO discard request

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38622010 15-Aug-2022 Boris Burkov <boris@bur.io>

btrfs: send: add support for fs-verity

Preserve the fs-verity status of a btrfs file across send/recv.

There is no facility for installing the Merkle tree contents directly on
the receiving filesystem, so we package up the parameters used to enable
verity found in the verity descriptor. This gives the receive side
enough information to properly enable verity again. Note that this means
that receive will have to re-compute the whole Merkle tree, similar to
how compression worked before encoded_write.

Since the file becomes read-only after verity is enabled, it is
important that verity is added to the send stream after any file writes.
Therefore, when we process a verity item, merely note that it happened,
then actually create the command in the send stream during
'finish_inode_if_needed'.

This also creates V3 of the send stream format, without any format
changes besides adding the new commands and attributes.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# d1f68ba0 23-Jul-2022 Omar Sandoval <osandov@osandov.com>

btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()

btrfs_insert_file_extent() is only ever used to insert holes, so rename
it and remove the redundant parameters.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5f4403e1 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add lockdep annotations for the ordered extents wait event

This wait event is very similar to the pending ordered wait event in the
sense that it occurs in a different context than the condition signaling
for the event. The signaling occurs in btrfs_remove_ordered_extent()
while the wait event is implemented in btrfs_start_ordered_extent() in
fs/btrfs/ordered-data.c

However, in this case a thread must not acquire the lockdep map for the
ordered extents wait event when the ordered extent is related to a free
space inode. That is because lockdep creates dependencies between locks
acquired both in execution paths related to normal inodes and paths
related to free space inodes, thus leading to false positives.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8b53779e 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add lockdep annotations for pending_ordered wait event

In contrast to the num_writers and num_extwriters wait events, the
condition for the pending ordered wait event is signaled in a different
context from the wait event itself. The condition signaling occurs in
btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
event is implemented in btrfs_commit_transaction() in
fs/btrfs/transaction.c

Thus the thread signaling the condition has to acquire the lockdep map
as a reader at the start of btrfs_remove_ordered_extent() and release it
after it has signaled the condition. In this case some dependencies
might be left out due to the placement of the annotation, but it is
better than no annotation at all.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3e738c53 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add lockdep annotations for transaction states wait events

Add lockdep annotations for the transaction states that have wait
events;

1) TRANS_STATE_COMMIT_START
2) TRANS_STATE_UNBLOCKED
3) TRANS_STATE_SUPER_COMMITTED
4) TRANS_STATE_COMPLETED

The new macros introduced here to annotate the transaction states wait
events have the same effect as the generic lockdep annotation macros.

With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
the transaction thread has to acquire the lockdep maps for the
transaction states as reader after the lockdep map for num_writers is
released so that lockdep does not complain.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a9ba670 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add lockdep annotations for num_extwriters wait event

Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
lockdep annotation for the num_extwriters wait event.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1489b4f 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add lockdep annotations for num_writers wait event

Annotate the num_writers wait event in fs/btrfs/transaction.c with
lockdep in order to catch deadlocks involving this wait event.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab9a323f 25-Jul-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: add macros for annotating wait events with lockdep

Introduce four macros that are used to annotate wait events in btrfs code
with lockdep;

1) the btrfs_lockdep_init_map
2) the btrfs_lockdep_acquire,
3) the btrfs_lockdep_release
4) the btrfs_might_wait_for_event macros.

The btrfs_lockdep_init_map macro is used to initialize a lockdep map.

The btrfs_lockdep_<acquire,release> macros are used by threads to take
the lockdep map as readers (shared lock) and release it, respectively.

The btrfs_might_wait_for_event macro is used by threads to take the
lockdep map as writers (exclusive lock) and release it.

In general, the lockdep annotation for wait events work as follows:

The condition for a wait event can be modified and signaled at the same
time by multiple threads. These threads hold the lockdep map as readers
when they enter a context in which blocking would prevent signaling the
condition. Frequently, this occurs when a thread violates a condition
(lockdep map acquire), before restoring it and signaling it at a later
point (lockdep map release).

The threads that block on the wait event take the lockdep map as writers
(exclusive lock). These threads have to block until all the threads that
hold the lockdep map as readers signal the condition for the wait event
and release the lockdep map.

The lockdep annotation is used to warn about potential deadlock scenarios
that involve the threads that modify and signal the wait event condition
and threads that block on the wait event. A simple example is illustrated
below:

Without lockdep:

TA TB
cond = false
lock(A)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)

With lockdep:

TA TB
rwsem_acquire_read(lockdep_map)
cond = false
lock(A)
rwsem_acquire(lockdep_map)
rwsem_release(lockdep_map)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
rwsem_release(lockdep_map)

In the second case, with the lockdep annotation, lockdep would warn about
an ABBA deadlock, while the first case would just deadlock at some point.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d5b81ced 30-Aug-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix API misuse of zone finish waiting

The commit 2ce543f47843 ("btrfs: zoned: wait until zone is finished when
allocation didn't progress") implemented a zone finish waiting mechanism
to the write path of zoned mode. However, using
wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
wait_var_event() just hangs because no one ever wakes it up once it goes
into sleep.

Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
on fs_info->flags with a proper barrier installed.

Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ced8ecf0 23-Aug-2022 Omar Sandoval <osandov@fb.com>

btrfs: fix space cache corruption and potential double allocations

When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:

1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.

All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.

Specifically, the race is as follows:

1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!

This explains all of our symptoms above:

* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.

The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.

The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.

A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:

1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.

Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b40130b2 26-Jul-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: fix lockdep splat with reloc root extent buffers

We have been hitting the following lockdep splat with btrfs/187 recently

WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30

-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

other info that might help us debug this:

Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);

*** DEADLOCK ***

7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:

dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of

reloc tree -> normal tree

for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.

However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of

normal tree -> reloc tree

which is why lockdep complains.

Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.

Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.

This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of

normal tree -> reloc tree

We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.

With this patch we no longer have the lockdep splat in btrfs/187.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81bd9328 06-Jul-2022 Christoph Hellwig <hch@lst.de>

btrfs: fix repair of compressed extents

Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.

This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.

Fix this by moving the repair to the lower btrfs_bio. To do so, a fair
amount of code has to be reshuffled:

a) the lower btrfs_bio now needs a valid csum pointer. The easiest way
to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
the btrfs_bio management of csums. For a compressed_bio that is
split into multiple btrfs_bios this means additional memory
allocations, but the code becomes a lot more regular.
b) checksum verification now runs directly on the lower btrfs_bio instead
of the compressed_bio. This actually nicely simplifies the end I/O
processing.
c) btrfs_repair_one_sector can't just look up the logical address for
the file offset any more, as there is no corresponding relative
offsets that apply to the file offset and the logic address for
compressed extents. Instead require that the saved bvec_iter in the
btrfs_bio is filled out for all read bios and use that, which again
removes a fair amount of code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7959bd44 06-Jul-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove the start argument to check_data_csum and export

Derive the value of start from the btrfs_bio now that ->file_offset is
always valid. Also export and rename the function so it's available
outside of inode.c as we'll need that soon.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ce543f4 08-Jul-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: wait until zone is finished when allocation didn't progress

When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.

Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7d7672bc 08-Jul-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: convert count_max_extents() to use fs_info->max_extent_size

If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.

Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7b12a62 08-Jul-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size

On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2ae7b77 08-Jul-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: revive max_zone_append_bytes

This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.

The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e55958c8 14-Jun-2022 Ioannis Angelakopoulos <iangelak@fb.com>

btrfs: collect commit stats, count, duration

Track several stats about transaction commit, to be later exported via
sysfs:

- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns

The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 37f85ec3 13-Jun-2022 Qu Wenruo <wqu@suse.com>

btrfs: use named constant for reserved device space

There's a reserved space on each device of size 1MiB that can be used by
bootloaders or to avoid accidental overwrite. Use a symbolic constant
with the explaining comment instead of hard coding the value and
multiple comments.

Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
the primary super block (at offset 64KiB), until then the range could
have been used by mistake. Kernel has been always respecting the 1MiB
range for writes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 6d92b304 25-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: pass bits by value not by pointer for extent_state helpers

The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.

Signed-off-by: David Sterba <dsterba@suse.com>


# 97f09d55 07-Jun-2022 Qu Wenruo <wqu@suse.com>

btrfs: make btrfs_super_block::log_root_transid deprecated

When using "btrfs inspect-internal dump-super" to inspect an fs with
dirty log, it always shows the log_root_transid as 0:

log_root 30474240
log_root_transid 0 <<<
log_root_level 0

It turns out that, btrfs_super_block::log_root_transid is never really
utilized (even no read for it).

This can date back to the introduction of btrfs into upstream kernel.

In fact, when reading log tree root, we always use
btrfs_super_block::generation + 1 as the expected generation.
So here we're completely safe to mark this member deprecated.

In theory we can easily reuse this member for other purposes, but to be
extra safe, here we follow the leafsize way, by adding "__unused_" for
log_root_transid.
And we can safely remove the accessors, since there is no such callers
from the very beginning.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d7b9416f 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove btrfs_end_io_wq

All reads bio that go through btrfs_map_bio need to be completed in
user context. And read I/Os are the most common and timing critical
in almost any file system workloads.

Embed a work_struct into struct btrfs_bio and use it to complete all
read bios submitted through btrfs_map, using the REQ_META flag to decide
which workqueue they are placed on.

This removes the need for a separate 128 byte allocation (typically
rounded up to 192 bytes by slab) for all reads with a size increase
of 24 bytes for struct btrfs_bio. Future patches will reorganize
struct btrfs_bio to make use of this extra space for writes as well.

(All sizes are based a on typical 64-bit non-debug build)

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# fed8a72d 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: don't use btrfs_bio_wq_end_io for compressed writes

Compressed write bio completion is the only user of btrfs_bio_wq_end_io
for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
here as we only real need user context for the final completion of a
compressed_bio structure, and not every single bio completion.

Add a work_struct to struct compressed_bio instead and use that to call
finish_compressed_bio_write. This allows to remove all handling of
write bios in the btrfs_bio_wq_end_io infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# d34e123d 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: defer I/O completion based on the btrfs_raid_bio

Instead of attaching an extra allocation an indirect call to each
low-level bio issued by the RAID code, add a work_struct to struct
btrfs_raid_bio and only defer the per-rbio completion action. The
per-bio action for all the I/Os are trivial and can be safely done
from interrupt context.

As a nice side effect this also allows sharing the boilerplate code
for the per-bio completions

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# c93104e7 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: split btrfs_submit_data_bio to read and write parts

Split btrfs_submit_data_bio into one helper for reads and one for writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3ea4dc5b 17-Mar-2022 Omar Sandoval <osandov@fb.com>

btrfs: send: send compressed extents with encoded writes

Now that all of the pieces are in place, we can use the ENCODED_WRITE
command to send compressed extents when appropriate.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a89ce08c 22-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: factor out a btrfs_csum_ptr helper

Add a helper to find the csum for a byte offset into the csum buffer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# ae643a74 22-May-2022 Qu Wenruo <wqu@suse.com>

btrfs: introduce a data checksum checking helper

Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utilize page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_sector_csum(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: keep passing the csum array as an arguments, as the callers want
to print it, rename per request]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 143823cf 25-May-2022 David Sterba <dsterba@suse.com>

btrfs: fix typos in comments

Codespell has found a few typos.

Signed-off-by: David Sterba <dsterba@suse.com>


# 088aea3b 15-Jul-2022 David Sterba <dsterba@suse.com>

Revert "btrfs: turn delayed_nodes_tree into an XArray"

This reverts commit 253bf57555e451dec5a7f09dc95d380ce8b10e5b.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 01cd3909 15-Jul-2022 David Sterba <dsterba@suse.com>

Revert "btrfs: turn fs_info member buffer_radix into XArray"

This reverts commit 8ee922689d67b7cfa6acbe2aa1ee76ac72e6fc8a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc7cbcd4 15-Jul-2022 David Sterba <dsterba@suse.com>

Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"

This reverts commit 48b36a602a335c184505346b5b37077840660634.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 983d8209 06-Jun-2022 Filipe Manana <fdmanana@suse.com>

btrfs: add missing inode updates on each iteration when replacing extents

When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.

By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.

So add the missing iversion and mtime/ctime updates.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a3e171a0 05-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: move struct btrfs_dio_private to inode.c

The btrfs_dio_private structure is only used in inode.c, so move the
definition there.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 36e8c622 05-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: add a btrfs_dio_rw wrapper

Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
isolated in inode.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb3a12d9 27-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: rename bio_flags in parameters and switch type

Several functions take parameter bio_flags that was simplified to just
compress type, unify it and change the type accordingly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2fe6a5a1 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: sink parameter is_data to btrfs_set_disk_extent_flags

The parameter has been added in 2009 in the infamous monster commit
5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT
CHANGE)") but not used ever since. We can sink it and allow further
simplifications.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 48b36a60 02-May-2022 Gabriel Niebler <gniebler@suse.com>

btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray

… rename it to simply fs_roots and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand, as
it provides array semantics, and also takes care of locking for us,
further simplifying the code.

Also do some refactoring, esp. where the API change requires largely
rewriting some functions, anyway.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ee92268 21-Apr-2022 Gabriel Niebler <gniebler@suse.com>

btrfs: turn fs_info member buffer_radix into XArray

… named 'extent_buffers'. Also adjust all usages of this object to use
the XArray API, which greatly simplifies the code as it takes care of
locking and is generally easier to use and understand, providing
notionally simpler array semantics.

Also perform some light refactoring.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 253bf575 26-Apr-2022 Gabriel Niebler <gniebler@suse.com>

btrfs: turn delayed_nodes_tree into an XArray

… in the btrfs_root struct and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand,
as it provides array semantics, and also takes care of locking for us,
further simplifying the code.

Also use the opportunity to do some light refactoring.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 385de0ef 17-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: use a normal workqueue for rmw_workers

rmw_workers doesn't need ordered execution or thread disabling threshold
(as the thresh parameter is less than DFT_THRESHOLD).

Just switch to the normal workqueues that use a lot less resources,
especially in the work_struct vs btrfs_work structures.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# be539518 17-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: use normal workqueues for scrub

All three scrub workqueues don't need ordered execution or thread
disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
Just switch to the normal workqueues that use a lot less resources,
especially in the work_struct vs btrfs_work structures.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a31b4a43 17-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: simplify WQ_HIGHPRI handling in struct btrfs_workqueue

Just let the one caller that wants optional WQ_HIGHPRI handling allocate
a separate btrfs_workqueue for that. This allows to rename struct
__btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and
separate allocation for all btrfs_workqueue users and generally simplify
the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ad357938 15-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: do not return errors from submit_bio_hook_t instances

Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
one of the instances ignore errors as they expect the methods themselves
to call ->bi_end_io on error. Remove the unused and dangerous return
value.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7aab8b32 15-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: move btrfs_readpage to extent_io.c

Keep btrfs_readpage next to btrfs_do_readpage and the other address
space operations. This allows to keep submit_one_bio and
struct btrfs_bio_ctrl file local in extent_io.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 16b0c258 13-Apr-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use a read/write lock for protecting the block groups tree

Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.

Read only searches on the tree are very frequent and done when:

1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;

2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;

3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;

4) Decreasing the number of extent reservations in a block group;

5) When allocating extents and freeing reserved extents;

6) Adding and removing free space to the free space tree;

7) When releasing delalloc bytes during ordered extent completion;

8) When relocating a block group;

9) During fitrim, to iterate over the block groups;

10) etc;

Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.

We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.

So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 08dddb29 13-Apr-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use rbtree with leftmost node cached for tracking lowest block group

We keep track of the start offset of the block group with the lowest start
offset at fs_info->first_logical_byte. This requires explicitly updating
that field every time we add, delete or lookup a block group to/from the
red black tree at fs_info->block_group_cache_tree.

Since the block group with the lowest start address happens to always be
the one that is the leftmost node of the tree, we can use a red black tree
that caches the left most node. Then when we need the start address of
that block group, we can just quickly get the leftmost node in the tree
and extract the start offset of that node's block group. This avoids the
need to explicitly keep track of that address in the dedicated member
fs_info->first_logical_byte, and it also allows the next patch in the
series to switch the lock that protects the red black tree from a spin
lock to a read/write lock - without this change it would be tricky
because block group searches also update fs_info->first_logical_byte.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e010b3d 24-Mar-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove the zoned/zone_size union in struct btrfs_fs_info

Reading a value from a different member of a union is not just a great
way to obfuscate code, but also creates an aliasing violation. Switch
btrfs_is_zoned to look at ->zone_size and remove the union.

Note: union was to simplify the detection of zoned filesystem but now
this is wrapped behind btrfs_is_zoned so we can drop the union.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# d4135134 23-Mar-2022 Filipe Manana <fdmanana@suse.com>

btrfs: avoid blocking on space revervation when doing nowait dio writes

When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.

So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.

This is part of a patchset comprised of the following patches:

btrfs: avoid blocking on page locks with nowait dio on compressed range
btrfs: avoid blocking nowait dio when locking file range
btrfs: avoid double nocow check when doing nowait dio writes
btrfs: stop allocating a path when checking if cross reference exists
btrfs: free path at can_nocow_extent() before checking for checksum items
btrfs: release path earlier at can_nocow_extent()
btrfs: avoid blocking when allocating context for nowait dio read/write
btrfs: avoid blocking on space revervation when doing nowait dio writes

The following test was run before and after applying this patchset:

$ cat io-uring-nodatacow-test.sh
#!/bin/bash

DEV=/dev/sdc
MNT=/mnt/sdc

MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"

NUM_JOBS=4
FILE_SIZE=8G
RUN_TIME=300

cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randrw
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF

echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT

fio /tmp/fio-job.ini

umount $MNT

The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.

Result before the patchset:

READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec

Result after the patchset:

READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec

That's about +7.2% throughput for reads and +6.9% for writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1a89f173 23-Mar-2022 Filipe Manana <fdmanana@suse.com>

btrfs: stop allocating a path when checking if cross reference exists

At btrfs_cross_ref_exist() we always allocate a path, but we really don't
need to because all its callers (only 2) already have an allocated path
that is not being used when they call btrfs_cross_ref_exist(). So change
btrfs_cross_ref_exist() to take a path as an argument and update both
its callers to pass in the unused path they have when they call
btrfs_cross_ref_exist().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0a66a31 17-Mar-2022 Jonathan Lassoff <jof@thejof.com>

btrfs: add messages to printk index

In order for end users to quickly react to new issues that come up in
production, it is proving useful to leverage this printk indexing
system. This printk index enables kernel developers to use calls to
printk() with changeable ad-hoc format strings, while still enabling end
users to detect changes and develop a semi-stable interface for
detecting and parsing these messages.

So that detailed Btrfs messages are captured by this printk index, this
patch wraps btrfs_printk and btrfs_handle_fs_error with macros.

Example of the generated list:
https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.com

Signed-off-by: Jonathan Lassoff <jof@thejof.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 63c34cb4 15-Mar-2022 Filipe Manana <fdmanana@suse.com>

btrfs: add and use helper to assert an inode range is clean

We have four different scenarios where we don't expect to find ordered
extents after locking a file range:

1) During plain fallocate;
2) During hole punching;
3) During zero range;
4) During reflinks (both cloning and deduplication).

This is because in all these cases we follow the pattern:

1) Lock the inode's VFS lock in exclusive mode;

2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
mmap writes;

3) Flush delalloc in a file range and wait for all ordered extents
to complete - both done through btrfs_wait_ordered_range();

4) Lock the file range in the inode's io_tree.

So add a helper that asserts that we don't have ordered extents for a
given range. Make the four scenarios listed above use this helper after
locking the respective file range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# caae78e0 14-Mar-2022 Omar Sandoval <osandov@fb.com>

btrfs: move common inode creation code into btrfs_create_new_inode()

All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:

1. It reduces code duplication.

2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().

3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.

4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3538d68dbd 14-Mar-2022 Omar Sandoval <osandov@fb.com>

btrfs: reserve correct number of items for inode creation

The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.

btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a1fd0c35 14-Mar-2022 Omar Sandoval <osandov@fb.com>

btrfs: allocate inode outside of btrfs_new_inode()

Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.

This also removes the comments about Smack checking i_op, which are no
longer true since commit 5d6c31910bc0 ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 62142be3 09-Mar-2022 Gabriel Niebler <gniebler@suse.com>

btrfs: introduce btrfs_for_each_slot iterator macro

There is a common pattern when searching for a key in btrfs:

* Call btrfs_search_slot to find the slot for the key
* Enter an endless loop:
* If the found slot is larger than the no. of items in the current
leaf, check the next leaf
* If it's still not found in the next leaf, terminate the loop
* Otherwise do something with the found key
* Increment the current slot and continue

To reduce code duplication, we can replace this code pattern with an
iterator macro, similar to the existing for_each_X macros found
elsewhere in the kernel. This also makes the code easier to understand
for newcomers by putting a name to the encapsulated functionality.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb12489b 29-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Convert btrfs to read_folio

This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 5f0addf7 18-Apr-2022 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: use dedicated lock for data relocation

Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
writeback of the relocation data inode in
btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
in the following path.

Thread A takes btrfs_inode_lock() and waits for metadata reservation by
e.g, waiting for writeback:

prealloc_file_extent_cluster()
- btrfs_inode_lock(&inode->vfs_inode, 0);
- btrfs_prealloc_file_range()
...
- btrfs_replace_file_extents()
- btrfs_start_transaction
...
- btrfs_reserve_metadata_bytes()

Thread B (e.g, doing a writeback work) needs to wait for the inode lock to
continue writeback process:

do_writepages
- btrfs_writepages
- extent_writpages
- btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
- btrfs_inode_lock()

The deadlock is caused by relying on the vfs_inode's lock. By using it, we
introduced unnecessary exclusion of writeback and
btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
don't have any dirty pages in the inode yet.

Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
writeback.

Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16.x: 869f4cdc73f9: btrfs: zoned: encapsulate inode locking for zoned relocation
CC: stable@vger.kernel.org # 5.16.x
CC: stable@vger.kernel.org # 5.17
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 895586eb 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: Convert from invalidatepage to invalidate_folio

A lot of the underlying infrastructure in btrfs needs to be switched
over to folios, but this at least documents that invalidatepage can't
be passed a tail page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# 7eefae6b 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: pass btrfs_fs_info to btrfs_recover_relocation

We don't need a root here, we just need the btrfs_fs_info, we can just
get the specific roots we need from fs_info.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c067da87 23-Feb-2022 Sweet Tea Dorminy <sweettea-kernel@dorminy.me>

btrfs: add filesystems state details to error messages

When a filesystem goes read-only due to an error, multiple errors tend
to be reported, some of which are knock-on failures. Logging fs_states,
in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the
first error from subsequent messages which may only exist due to an
error state.

Under the new format, most initial errors will look like:
`BTRFS: error (device loop0) in ...`
while subsequent errors will begin with:
`error (device loop0: state E) in ...`

An initial transaction abort error will look like
`error (device loop0: state A) in ...`
and subsequent messages will contain
`(device loop0: state EA) in ...`

In addition to the error states we can also print other states that are
temporary, like remounting, device replace, or indicate a global state
that may affect functionality.

Now implemented:

E - filesystem error detected
A - transaction aborted
L - log tree errors

M - remounting in progress
R - device replace in progress
C - data checksums not verified (mounted with ignoredatacsums)

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7c0c7269 13-Aug-2019 Omar Sandoval <osandov@fb.com>

btrfs: add BTRFS_IOC_ENCODED_WRITE

The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1881fba8 09-Oct-2019 Omar Sandoval <osandov@fb.com>

btrfs: add BTRFS_IOC_ENCODED_READ ioctl

There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.

This initial implementation simplifies a few things that can be improved
in the future:

- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.

Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:

btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)

we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:

btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached

In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28c9b1e7 18-Nov-2019 Omar Sandoval <osandov@fb.com>

btrfs: support different disk extent size for delalloc

Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e331f6b1 06-Nov-2019 Omar Sandoval <osandov@fb.com>

btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()

btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because:

1) the compressed bio always covers one ordered extent
2) the size of the bio is always less than the size of the ordered
extent

However, the second point will not always be true for encoded writes.

Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a55e65b8 01-Feb-2022 David Sterba <dsterba@suse.com>

btrfs: replace BUILD_BUG_ON by static_assert

The static_assert introduced in 6bab69c65013 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.

The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7238e50 15-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add support for multiple global roots

With extent tree v2 you will be able to create multiple csum, extent,
and free space trees. They will be used based on the block group, which
will now use the block_group_item->chunk_objectid to point to the set of
global roots that it will use. When allocating new block groups we'll
simply mod the gigabyte offset of the block group against the number of
global roots we have and that will be the block groups global id.

>From there we can take the bytenr that we're modifying in the respective
tree, look up the block group and get that block groups corresponding
global root id. From there we can get to the appropriate global root
for that bytenr.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9c54e80d 15-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add code to support the block group root

This code adds the on disk structures for the block group root, which
will hold the block group items for extent tree v2.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2c7d2a23 15-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add definition for EXTENT_TREE_V2

This adds the initial definition of the EXTENT_TREE_V2 incompat feature
flag. This also hides the support behind CONFIG_BTRFS_DEBUG.

THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
DEVELOPER OR A TESTER.

The format is in flux and will be added in stages, any fs will need to
be re-made between updates to the format.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b4be6aef 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not start relocation until in progress drops are done

We hit a bug with a recovering relocation on mount for one of our file
systems in production. I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time. This
presented as an error while looking up an extent item

WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
RIP: 0010:lookup_inline_extent_backref+0x647/0x680
RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
Call Trace:
<TASK>
insert_inline_extent_backref+0x46/0xd0
__btrfs_inc_extent_ref.isra.0+0x5f/0x200
? btrfs_merge_delayed_refs+0x164/0x190
__btrfs_run_delayed_refs+0x561/0xfa0
? btrfs_search_slot+0x7b4/0xb30
? btrfs_update_root+0x1a9/0x2c0
btrfs_run_delayed_refs+0x73/0x1f0
? btrfs_update_root+0x1a9/0x2c0
btrfs_commit_transaction+0x50/0xa50
? btrfs_update_reloc_root+0x122/0x220
prepare_to_merge+0x29f/0x320
relocate_block_group+0x2b8/0x550
btrfs_relocate_block_group+0x1a6/0x350
btrfs_relocate_chunk+0x27/0xe0
btrfs_balance+0x777/0xe60
balance_kthread+0x35/0x50
? btrfs_balance+0xe60/0xe60
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>

Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex. However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.

Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.

Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key. If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info. Then
balance can wait until this flag is cleared to start up again.

If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 558732df 13-Feb-2022 Qu Wenruo <wqu@suse.com>

btrfs: reduce extent threshold for autodefrag

There is a big gap between inode_should_defrag() and autodefrag extent
size threshold. For inode_should_defrag() it has a flexible
@small_write value. For compressed extent is 16K, and for non-compressed
extent it's 64K.

However for autodefrag extent size threshold, it's always fixed to the
default value (256K).

This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't trigger autodefrag:

pwrite 0 8k
sync
pwrite 8k 128K
sync

The latter 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.

Such behavior can cause extra IO for autodefrag.

Close the gap, by copying the @small_write value into inode_defrag, so
that later autodefrag can use the same @small_write value which
triggered autodefrag.

With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.

Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.

CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 40cdc509 18-Jan-2022 Filipe Manana <fdmanana@suse.com>

btrfs: skip reserved bytes warning on unmount after log cleanup failure

After the recent changes made by commit c2e39305299f01 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.

This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.

In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.

When this happens we get the following traces when unmounting the fs:

[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0

This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.

This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.

The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:

1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.

How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:

https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/

However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:

https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/

2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;

3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).

Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:

1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;

2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;

3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.

This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f26c9238 14-Dec-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove reada infrastructure

Currently there is only one user for btrfs metadata readahead, and
that's scrub.

But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).

Despite this, there are some extra problems related to metadata
readahead:

- Duplicated feature with btrfs_path::reada

- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.

- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.

This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.

- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.

Thus here I purpose to remove the metadata readahead mechanism
completely.

[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.

For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.

The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.

Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%

Comprehensive test methodology is in the cover letter of the patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 54f03ab1 03-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_truncate_inode_items to inode-item.c

This is an inode item related manipulation with a few vfs related
adjustments. I'm going to remove the vfs related code from this helper
and simplify it a lot, but I want those changes to be easily seen via
git blame, so move this function now and then the simplification work
can be done.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 26c2c454 03-Dec-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add an inode-item.h

We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# efc0e69c 25-Nov-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: introduce exclusive operation BALANCE_PAUSED state

Current set of exclusive operation states is not sufficient to handle
all practical use cases. In particular there is a need to be able to add
a device to a filesystem that have paused balance. Currently there is no
way to distinguish between a running and a paused balance. Fix this by
introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
occasions:

1. When a filesystem is mounted with skip_balance and there is an
unfinished balance it will now be into BALANCE_PAUSED instead of
simply BALANCE state.

2. When a running balance is paused.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d96b3424 21-Nov-2021 Filipe Manana <fdmanana@suse.com>

btrfs: make send work with concurrent block group relocation

We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
semaphore;

2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;

4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abed4aaa 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: track the csum, extent, and free space trees in a rb tree

In the future we are going to have multiple copies of these trees. To
facilitate this we need a way to lookup the different roots we are
looking for. Handle this by adding a global root rb tree that is
indexed on the root->root_key. Then instead of loading the roots at
mount time with individually targeted keys, simply search the tree_root
for anything with the specific objectid we want. This will make it
straightforward to support both old style and new style file systems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7939dd9f 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: stop accessing ->free_space_root directly

We're going to have multiple free space roots in the future, so adjust
all the users of the free space root to use a helper to access the root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc28b25e 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: stop accessing ->csum_root directly

We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 056c8311 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum root

We have a few places where we skip doing csums if we mounted with one of
the rescue options that ignores bad csum roots. In the future when
there are multiple csum roots it'll be costly to check and see if there
are any missing csum roots, so simply add a flag to indicate the fs
should skip loading csums in case of errors.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 29cbcf40 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: stop accessing ->extent_root directly

When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root. Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper. This will allow us to easily clean up
the remaining direct accesses in the future.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fdfbf020 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: rework async transaction committing

Currently we do this awful thing where we get another ref on a trans
handle, async off that handle and commit the transaction from that work.
Because we do this we have to mess with current->journal_info and the
freeze counting stuff.

We already have an async thing to kick for the transaction commit, the
transaction kthread. Replace this work struct with a flag on the
fs_info to tell the kthread to go ahead and commit even if it's before
our timeout. Then we can drastically simplify the async transaction
commit path.

Note: this can be simplified and functionality based on the pending
operation COMMIT.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0af4769d 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: remove unused BTRFS_FS_BARRIER flag

This is no longer used, the -o nobarrier is handled by
BTRFS_MOUNT_NOBARRIER. Remove the flag.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 54230013 09-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: get rid of root->orphan_cleanup_state

Now that we don't care about the stage of the orphan_cleanup_state,
simply replace it with a bit on ->state to make sure we don't call the
orphan cleanup every time we wander into this root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dc2e724e 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: rename btrfs_item_end_nr to btrfs_item_data_end

The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
the offset of the end of the data the item points to. In fact all of
the helpers that we use btrfs_item_end_nr() use data in their name, like
BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end()
to make it clear what this helper is giving us.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a08663d 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: remove the btrfs_item_end() helper

We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
be collapsed.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3212fa14 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: drop the _nr from the item helpers

Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 74794207 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce item_nr token variant helpers

The last remaining place where we have the pattern of

item = btrfs_item_nr(slot)
<do something with the item>

are the token helpers. Handle this by introducing token helpers that
will do the btrfs_item_nr() work inside of the helper itself, and then
convert all users of the btrfs_item token helpers to the new _nr()
variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 437bd07e 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: make btrfs_file_extent_inline_item_len take a slot

Instead of getting the btrfs_item for this, simply pass in the slot of
the item and then use the btrfs_item_size_nr() helper inside of
btrfs_file_extent_inline_item_len().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c91666b1 21-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add btrfs_set_item_*_nr() helpers

We have the pattern of

item = btrfs_item_nr(slot);
btrfs_set_item_*(leaf, item);

in a bunch of places in our code. Fix this by adding
btrfs_set_item_*_nr() helpers which will do the appropriate work, and
replace those calls with

btrfs_set_item_*_nr(leaf, slot);

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7a163608 13-Dec-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix invalid delayed ref after subvolume creation failure

When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
insert the new root's root item into the root tree, we are freeing the
metadata extent we reserved for the new root to prevent a metadata
extent leak, as we don't abort the transaction at that point (since
there is nothing at that point that is irreversible).

However we allocated the metadata extent for the new root which we are
creating for the new subvolume, so its delayed reference refers to the
ID of this new root. But when we free the metadata extent we pass the
root of the subvolume where the new subvolume is located to
btrfs_free_tree_block() - this is incorrect because this will generate
a delayed reference that refers to the ID of the parent subvolume's root,
and not to ID of the new root.

This results in a failure when running delayed references that leads to
a transaction abort and a trace like the following:

[3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
[3868.739857] Code: 68 0f 85 e6 fb ff (...)
[3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
[3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
[3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
[3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
[3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
[3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
[3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
[3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
[3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3868.757803] Call Trace:
[3868.758281] <TASK>
[3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
[3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
[3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs]
[3868.762069] ? lock_acquired+0x19f/0x420
[3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs]
[3868.763860] ? _raw_spin_unlock+0x29/0x40
[3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
[3868.765870] create_subvol+0x1d8/0x9a0 [btrfs]
[3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs]
[3868.767669] ? preempt_count_add+0x49/0xa0
[3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
[3868.769639] ? _copy_from_user+0x66/0xa0
[3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs]
[3868.772364] ? __slab_free+0x10a/0x360
[3868.773198] ? rcu_read_lock_sched_held+0x12/0x60
[3868.774121] ? lock_release+0x223/0x4a0
[3868.774863] ? lock_acquired+0x19f/0x420
[3868.775634] ? rcu_read_lock_sched_held+0x12/0x60
[3868.776530] ? trace_hardirqs_on+0x1b/0xe0
[3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[3868.778280] ? kmem_cache_free+0x321/0x3c0
[3868.779011] ? __x64_sys_ioctl+0x83/0xb0
[3868.779718] __x64_sys_ioctl+0x83/0xb0
[3868.780387] do_syscall_64+0x3b/0xc0
[3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3868.781953] RIP: 0033:0x7f281c59e957
[3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
[3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
[3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
[3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
[3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
[3868.793765] </TASK>
[3868.794037] irq event stamp: 0
[3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
[3868.799284] ---[ end trace be24c7002fe27747 ]---
[3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
[3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
[3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33
[3868.802863] extent refs 1 gen 1265 flags 2
[3868.803447] ref#0: tree block backref root 1610
(...)
[3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33
[3869.065421] extent refs 1 gen 1268 flags 2
[3869.066115] ref#0: tree block backref root 1689
(...)
[3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0
[3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
[3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry

Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
This requires changing the root argument of btrfs_free_tree_block() from
struct btrfs_root * to a u64, since at this point during the subvolume
creation we have not yet created the struct btrfs_root for the new
subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
else from a struct btrfs_root.

This was triggered by test case generic/475 from fstests.

Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4467af88 25-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove root argument from btrfs_unlink_inode()

The root argument passed to btrfs_unlink_inode() and its callee,
__btrfs_unlink_inode(), always matches the root of the given directory and
the given inode. So remove the argument and make __btrfs_unlink_inode()
use the root of the directory.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38732474 20-Oct-2021 Qu Wenruo <wqu@suse.com>

btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE

It's a common practice to avoid use sizeof(struct btrfs_super_block)
(3531), but to use BTRFS_SUPER_INFO_SIZE (4096).

The problem is that, sizeof(struct btrfs_super_block) doesn't match
BTRFS_SUPER_INFO_SIZE from the very beginning.

Furthermore, for all call sites except selftests, we always allocate
BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason
to use the smaller value, and it doesn't really save any space.

So let's get rid of such confusing behavior, and unify those two values.

This modification also adds a new static_assert() to verify the size,
and moves the BTRFS_SUPER_INFO_* macros to the definition of
btrfs_super_block for the static_assert().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 84961539 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add a BTRFS_FS_ERROR helper

We have a few flags that are inconsistently used to describe the fs in
different states of failure. As of 5963ffcaf383 ("btrfs: always abort
the transaction if we abort a trans handle") we will always set
BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
and ERROR to see if things have gone wrong. Add a helper to check
BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
use the helper.

The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up
resources during umount after trans is aborted") but is not actually
specific.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6aabd858 27-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unused function btrfs_bio_fits_in_stripe()

As the last caller in compression.c has been removed, we don't need that
function anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f0641656 23-Sep-2021 Filipe Manana <fdmanana@suse.com>

btrfs: unexport setup_items_for_insert()

Since setup_items_for_insert() is not used anymore outside of ctree.c,
make it static and remove its prototype from ctree.h. This also requires
to move the definition of setup_item_for_insert() from ctree.h to ctree.c
and move down btrfs_duplicate_item() so that it's defined after
setup_items_for_insert().

Further, since setup_item_for_insert() is used outside ctree.c, rename it
to btrfs_setup_item_for_insert().

This patch is part of a small patchset that is comprised of the following
patches:

btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories

This is patch 2/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b7ef5f3a 23-Sep-2021 Filipe Manana <fdmanana@suse.com>

btrfs: loop only once over data sizes array when inserting an item batch

When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:

1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;

2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;

3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.

That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.

It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.

So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().

This patch is part of a small patchset that is comprised of the following
patches:

btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories

This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c3a3b19b 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename struct btrfs_io_bio to btrfs_bio

Previously we had "struct btrfs_bio", which records IO context for
mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
btrfs specific info for logical bytenr bio.

With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
"btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.

The struct btrfs_bio changes meaning by this commit. There was a
suggested name like btrfs_logical_bio but it's a bit long and we'd
prefer to use a shorter name.

This could be a concern for backports to older kernels where the
different meaning could possibly cause confusion or bugs. Comparing the
new and old structures, there's no overlap among the struct members so a
build would break in case of incorrect backport.

We haven't had many backports to bio code anyway so this is more of a
theoretical cause of bugs and a matter of precaution but we'll need to
keep the semantic change in mind.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2707a25 08-Sep-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: add a dedicated data relocation block group

Relocation in a zoned filesystem can fail with a transaction abort with
error -22 (EINVAL). This happens because the relocation code assumes that
the extents we relocated the data to have the same size the source extents
had and ensures this by preallocating the extents.

But in a zoned filesystem we currently can't preallocate the extents as
this would break the sequential write required rule. Therefore it can
happen that the writeback process kicks in while we're still adding pages
to a delalloc range and starts writing out dirty pages.

This then creates destination extents that are smaller than the source
extents, triggering the following safety check in get_new_location():

1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
1035 ret = -EINVAL;
1036 goto out;
1037 }

Temporarily create a dedicated block group for the relocation process, so
no non-relocation data writes can interfere with the relocation writes.

This is needed that we can switch the relocation process on a zoned
filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.

Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 37f00a6d 08-Sep-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: introduce btrfs_is_data_reloc_root

There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.

Factor out the check into a small helper function instead of open coding
it multiple times.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afba2bc0 19-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: implement active zone tracking

Add zone_is_active flag to btrfs_block_group. This flag indicates the
underlying zones are all active. Such zone active block groups are tracked
by fs_info->active_bg_list.

btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
device part. They set/clear the bitmap to indicate zone activeness and
count the number of zones we can activate left.

btrfs_zone_{activate,finish}() take responsibility for the logical part and
the list management. In addition, btrfs_zone_finish() wait for any writes
on it and send REQ_OP_ZONE_FINISH to the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1ccc2e8a 06-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file()

Currently btrfs_defrag_file() accepts both "struct inode" and "struct
file" as parameter. We can easily grab "struct inode" from "struct
file" using file_inode() helper.

The reason why we need "struct file" is just to re-use its f_ra.

Change this to pass "struct file_ra_state" parameter, so that it's more
clear what we really want. Since we're here, also add some comments on
the function btrfs_defrag_file().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8481dd80 17-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: introduce btrfs_subpage_bitmap_info

Currently we use fixed size u16 bitmap for subpage bitmap. This is fine
for 4K sectorsize with 64K page size.

But for 4K sectorsize and larger page size, the bitmap is too small,
while for smaller page size like 16K, u16 bitmaps waste too much space.

Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to
record the proper bitmap size, and where each bitmap should start at.

By this, we can later compact all subpage bitmaps into one u32 bitmap.
This patch is the first step.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8dcbc261 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: unify lookup return value when dir entry is missing

btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
entries and both are used during log replay or when updating a log tree
during an unlink.

However when the dir item does not exists, btrfs_lookup_dir_item() returns
NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
the dir item exists but there is no matching entry for a given name or
index, both return NULL. This makes the call sites during log replay to
be more verbose than necessary and it makes it easy to miss this slight
difference. Since we don't need to distinguish between those two cases,
make btrfs_lookup_dir_index_item() always return NULL when there is no
matching directory entry - either because there isn't any dir entry or
because there is one but it does not match the given name and index.

Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
'index' since it is supposed to match an index number, and the name
'objectid' is not very good because it can easily be confused with an
inode number (like the inode number a dir entry points to).

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0cad6246 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: add rcu argument to ->get_acl() callback

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4d4340c9 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls

Creating subvolumes and snapshots is one of the core features of btrfs
and is even available to unprivileged users. Make it possible to use
subvolume and snapshot creation on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0ff40a91 29-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: introduce btrfs_search_backwards function

It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.

Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.

The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 14605409 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: initial fsverity support

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 77eea05e 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: add ro compat flags to inodes

Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).

Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 03fe78cc 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc

We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 809d6902 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: make btrfs_next_leaf static inline

btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
to header to avoid the function call overhead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 25c1252a 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered

The uptodate parameter should be bool, change the type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a129ffb8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()

Since commit d75855b4518b ("btrfs: Remove
extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
and adds btrfs_writepage_cow_fixup() function, there is no need to
follow the old hook parameters.

Remove the @start and @end hook, since currently the fixup check is full
page check, it doesn't need @start and @end hook.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a80d1c6 28-Jun-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: remove max_zone_append_size logic

There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c107a ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dce642 ("dm
crypt: Fix zoned block device support").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 629e33a1 22-Jun-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: remove unused btrfs_fs_info::total_pinned

This got added 14 years ago in 324ae4df00fd ("Btrfs: Add block group
pinned accounting back") but it was not ever used. Subsequently its
usage got gradually removed in 8790d502e440 ("Btrfs: Add support for
mirroring across drives") and 11833d66be94 ("Btrfs: improve async block
group caching"). Let's remove it for good!

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c416a30c 22-Jun-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: rip out may_commit_transaction

may_commit_transaction was introduced before the ticketing
infrastructure existed. There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail. Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.

This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out. This made true ENOSPC tests much faster as we didn't waste a bunch
of time.

However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction. The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc. And in the meantime the original problem has been
solved with ticketing. We no longer will commit the transaction 1000
times. Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out. The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.

So simplify this step by simply unconditionally committing the
transaction. This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cea5cf0 21-Jun-2021 Filipe Manana <fdmanana@suse.com>

btrfs: ensure relocation never runs while we have send operations running

Relocation and send do not play well together because while send is
running a block group can be relocated, a transaction committed and
the respective disk extents get re-allocated and written to or discarded
while send is about to do something with the extents.

This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures
and crashes due to concurrent relocation"), which prevented balance and
send from running in parallel but it did not address one remaining case
where chunk relocation can happen: shrinking a device (and device deletion
which shrinks a device's size to 0 before deleting the device).

We also have now one more case where relocation is triggered: on zoned
filesystems partially used block groups get relocated by a background
thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically
reclaim zones").

So make sure that instead of preventing balance from running when there
are ongoing send operations, we prevent relocation from happening.
This uses the infrastructure recently added by a patch that has the
subject: "btrfs: add cancellable chunk relocation support".

Also it adds a spinlock used exclusively for the exclusivity between
send and relocation, as before fs_info->balance_mutex was used, which
would make an attempt to run send to block waiting for balance to
finish, which can take a lot of time on large filesystems.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cbeaae4f 18-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: shorten integrity checker extent data mount option

Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
mount option name.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ccd9395b 18-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: switch mount option bits to enums and use wider type

Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
recorded in the debugging information for convenience).

There are two more things done but separating them would not make much
sense as it's touching the same lines:

- Renumber shifts 18..31 to 17..30 to get rid of the hole in the
sequence.

- Use 1UL as the value that gets shifted because we're approaching the
32bit limit and due to integer promotions the value of (1 << 31)
becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
manipulating helpers).

This is not causing any problems yet as the operations are in-memory
and masking the 31st bit works, we don't have more than 31 bits so the
ill effects of not masking higher bits don't happen. But once we have
more, the problems will emerge.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1a9fd417 21-May-2021 David Sterba <dsterba@suse.com>

btrfs: fix typos in comments

Fix typos that have snuck in since the last round. Found by codespell.

Signed-off-by: David Sterba <dsterba@suse.com>


# d2a91064 31-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make btrfs_set_range_writeback() subpage compatible

Function btrfs_set_range_writeback() currently just sets the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_fs_info, also change the parameter
to accept btrfs_inode.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f57ad937 07-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename PagePrivate2 to PageOrdered inside btrfs

Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38a39ac7 08-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 390ed29b 14-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: refactor submit_extent_page() to make bio and its flag tracing easier

There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".

Such strange parameters are here for bio assembly.

For example, we have such inode page layout:

0 4K 8K 12K
|<-- Extent A-->|<- EB->|

Then what we do is:

- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.

This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:

- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)

- end_write_bio()
- flush_write_bio()
Just change how bio is fetched

- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.

- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 578bda9e 18-May-2021 David Sterba <dsterba@suse.com>

btrfs: introduce try-lock semantics for exclusive op start

Add try-lock for exclusive operation start to allow callers to do more
checks. The same operation must already be running. The try-lock and
unlock must pair and are a substitute for btrfs_exclop_start, thus it
must also pair with btrfs_exclop_finish to release the exclop context.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 907d2710 17-May-2021 David Sterba <dsterba@suse.com>

btrfs: add cancellable chunk relocation support

Add support code that will allow canceling relocation on the chunk
granularity. This is different and independent of balance, that also
uses relocation but is a higher level operation and manages it's own
state and pause/cancellation requests.

Relocation is used for resize (shrink) and device deletion so this will
be a common point to implement cancellation for both. The context is
entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
enclosing one chunk relocation. The status bit is set and unset between
the chunks. As relocation can take long, the effects may not be
immediate and the request and actual action can slightly race.

The fs_info::reloc_cancel_req is only supposed to be increased and does
not pair with decrement like fs_info::balance_cancel_req.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d7ed32c 14-May-2021 David Sterba <dsterba@suse.com>

btrfs: protect exclusive_operation by super_lock

The exclusive operation is now atomically checked and set using bit
operations. Switch it to protection by spinlock. The super block lock is
not frequently used and adding a new lock seems like an overkill so it
should be safe to reuse it.

The reason to use spinlock is to enhance the locking context so more
checks can be done, eg. allowing the same exclusive operation enter
the exclop section and cancel the running one. This will be used for
resize and device delete.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 49547068 15-Sep-2020 David Sterba <dsterba@suse.com>

btrfs: document byte swap optimization of root_item::flags accessors

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d7d3165 24-May-2021 Filipe Manana <fdmanana@suse.com>

btrfs: don't set the full sync flag when truncation does not touch extents

At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.

This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.

However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.

One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.

After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.

Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 08508fea 02-May-2021 Qu Wenruo <wqu@suse.com>

btrfs: make btrfs_verify_data_csum() to return a bitmap

This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.

As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.

The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).

But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f9baa501 21-Apr-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix deadlock when cloning inline extents and using qgroups

There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.

When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").

However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.

When this issue happens, stack traces like the following are reported:

[72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000
[72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
[72747.556271] Call Trace:
[72747.556273] __schedule+0x296/0x760
[72747.556277] schedule+0x3c/0xa0
[72747.556279] io_schedule+0x12/0x40
[72747.556284] __lock_page+0x13c/0x280
[72747.556287] ? generic_file_readonly_mmap+0x70/0x70
[72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs]
[72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160
[72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
[72747.556362] ? update_group_capacity+0x25/0x210
[72747.556366] ? cpumask_next_and+0x1a/0x20
[72747.556391] extent_writepages+0x44/0xa0 [btrfs]
[72747.556394] do_writepages+0x41/0xd0
[72747.556398] __writeback_single_inode+0x39/0x2a0
[72747.556403] writeback_sb_inodes+0x1ea/0x440
[72747.556407] __writeback_inodes_wb+0x5f/0xc0
[72747.556410] wb_writeback+0x235/0x2b0
[72747.556414] ? get_nr_inodes+0x35/0x50
[72747.556417] wb_workfn+0x354/0x490
[72747.556420] ? newidle_balance+0x2c5/0x3e0
[72747.556424] process_one_work+0x1aa/0x340
[72747.556426] worker_thread+0x30/0x390
[72747.556429] ? create_worker+0x1a0/0x1a0
[72747.556432] kthread+0x116/0x130
[72747.556435] ? kthread_park+0x80/0x80
[72747.556438] ret_from_fork+0x1f/0x30

[72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[72747.566961] Call Trace:
[72747.566964] __schedule+0x296/0x760
[72747.566968] ? finish_wait+0x80/0x80
[72747.566970] schedule+0x3c/0xa0
[72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
[72747.566999] ? finish_wait+0x80/0x80
[72747.567024] lock_extent_bits+0x37/0x90 [btrfs]
[72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs]
[72747.567051] ? find_get_pages_range_tag+0x2cd/0x380
[72747.567076] __extent_writepage+0x203/0x320 [btrfs]
[72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[72747.567106] ? update_load_avg+0x7e/0x5f0
[72747.567109] ? enqueue_entity+0xf4/0x6f0
[72747.567134] extent_writepages+0x44/0xa0 [btrfs]
[72747.567137] ? enqueue_task_fair+0x93/0x6f0
[72747.567140] do_writepages+0x41/0xd0
[72747.567144] __filemap_fdatawrite_range+0xc7/0x100
[72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs]
[72747.567200] process_one_work+0x1aa/0x340
[72747.567202] worker_thread+0x30/0x390
[72747.567205] ? create_worker+0x1a0/0x1a0
[72747.567208] kthread+0x116/0x130
[72747.567211] ? kthread_park+0x80/0x80
[72747.567214] ret_from_fork+0x1f/0x30

[72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000
[72747.569689] Call Trace:
[72747.569691] __schedule+0x296/0x760
[72747.569694] schedule+0x3c/0xa0
[72747.569721] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569725] ? finish_wait+0x80/0x80
[72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
[72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
[72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
[72747.569810] ? path_lookupat.isra.48+0x97/0x140
[72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs]
[72747.569836] ? __kmalloc+0x16a/0x2c0
[72747.569839] do_iter_readv_writev+0x160/0x1c0
[72747.569843] do_iter_write+0x80/0x1b0
[72747.569847] vfs_writev+0x84/0x140
[72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs]
[72747.569873] do_writev+0x65/0x100
[72747.569876] do_syscall_64+0x33/0x40
[72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000
[72747.569903] Call Trace:
[72747.569906] __schedule+0x296/0x760
[72747.569909] schedule+0x3c/0xa0
[72747.569936] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569940] ? finish_wait+0x80/0x80
[72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
[72747.569989] start_transaction+0x279/0x580 [btrfs]
[72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs]
[72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs]
[72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs]
[72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs]
[72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
[72747.570126] do_clone_file_range+0xed/0x200
[72747.570131] vfs_clone_file_range+0x37/0x110
[72747.570134] ioctl_file_clone+0x7d/0xb0
[72747.570137] do_vfs_ioctl+0x138/0x630
[72747.570140] __x64_sys_ioctl+0x62/0xc0
[72747.570143] do_syscall_64+0x33/0x40
[72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9

So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.

The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 97fc2977 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

btrfs: convert to fileattr

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: David Sterba <dsterba@suse.com>


# 18bb8bbf 19-Apr-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: automatically reclaim zones

When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
by default.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3372065 19-Apr-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock

As a preparation for extending the block group deletion use case, rename
the unused_bgs_mutex to reclaim_bgs_lock.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e9306ad4 24-Feb-2021 Qu Wenruo <wqu@suse.com>

btrfs: more graceful errors/warnings on 32bit systems when reaching limits

Btrfs uses internally mapped u64 address space for all its metadata.
Due to the page cache limit on 32bit systems, btrfs can't access
metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K
page size while 256T for 64K page size.

Users can have a filesystem which doesn't have metadata beyond the
boundary at mount time, but later balance can cause it to create
metadata beyond the boundary.

And modification to MM layer is unrealistic just for such minor use
case. We can't do more than to prevent mounting such filesystem or warn
early when the numbers are still within the limits.

To address such problem, this patch will introduce the following checks:

- Mount time rejection
This will reject any fs which has metadata chunk at or beyond the
boundary.

- Mount time early warning
If there is any metadata chunk beyond 5/8th of the boundary, we do an
early warning and hope the end user will see it.

- Runtime extent buffer rejection
If we're going to allocate an extent buffer at or beyond the boundary,
reject such request with EOVERFLOW.
This is definitely going to cause problems like transaction abort, but
we have no better ways.

- Runtime extent buffer early warning
If an extent buffer beyond 5/8th of the max file size is allocated, do
an early warning.

Above error/warning message will only be printed once for each fs to
reduce dmesg flood.

If the mount is rejected, the filesystem will be mountable only on a
64bit host.

Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/
Reported-by: Erik Jensen <erikjensen@rkjnsn.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ace75066 31-Mar-2021 Filipe Manana <fdmanana@suse.com>

btrfs: improve btree readahead for full send operations

Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:

1) It only triggers readahead for leaves that are physically close
to the leaf being read, within a 64K range;

2) It only triggers readahead for the next or previous leaves if the
leaf being read is not currently in memory;

3) It never triggers readahead for nodes.

So add a new readahead mode that addresses all these points and use it
for full send operations.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:

$ cat test.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit

mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT

# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))

echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done

wait ${worker_pids[@]}

sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}

initial_file_count=500000
add_files $initial_file_count 0 4

echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1

echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4

echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done

echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2

umount $MNT

echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null

mount $MOUNT_OPTIONS $DEV $MNT

echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"

umount $MNT

The durations of the full send operation in seconds were the following:

Before this change: 217 seconds
After this change: 205 seconds (-5.7%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc03f39e 11-Mar-2021 Filipe Manana <fdmanana@suse.com>

btrfs: use a bit to track the existence of tree mod log users

The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.

Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f3a84ccd 11-Mar-2021 Filipe Manana <fdmanana@suse.com>

btrfs: move the tree mod log code into its own file

The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.

To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.

This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>


# cea62800 16-Mar-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: remove duplicated in_range() macro

The in_range() macro is defined twice in btrfs' source, once in ctree.h
and once in misc.h.

Remove the definition in ctree.h and include misc.h in the files depending
on it.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8318ba79 10-Feb-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add a i_mmap_lock to our inode

We need to be able to exclude page_mkwrite from happening concurrently
with certain operations. To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well. I used pahole to
check the size of the btrfs_inode, the sizes are as follows

no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)

lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)

We're slightly larger but it doesn't change how many objects we can fit
per page.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e295768 03-Mar-2021 Goldwyn Rodrigues <rgoldwyn@suse.de>

btrfs: remove mirror argument from btrfs_csum_verify_data()

The parameter mirror is not used and does not make sense for checksum
verification of the given bio.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 05947ae1 10-Feb-2021 Anand Jain <anand.jain@oracle.com>

btrfs: unexport btrfs_extent_readonly() and make it static

btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bfc78479 17-Feb-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_replace_file_extents take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 195a49ea 04-Feb-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between writes to swap files and scrub

When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 9d294a68 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: enable to mount ZONED incompat flag

This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 40ab3be1 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: extend zoned allocator to use dedicated tree-log block group

This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams. As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0bc09ca1 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: serialize metadata IO

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cacb2cea 04-Feb-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: check if bio spans across an ordered extent

To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 576fa348 09-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: improve preemptive background space flushing

Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f00c42dd 09-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce a FORCE_COMMIT_TRANS flush operation

Solely for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction(). This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 5deb17e1 09-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: track ordered bytes instead of just dio ordered bytes

We track dio_bytes because the shrink delalloc code needs to know if we
have more DIO in flight than we have normal buffered IO. The reason for
this is because we can't "flush" DIO, we have to just wait on the
ordered extents to finish.

However this is true of all ordered extents. If we have more ordered
space outstanding than dirty pages we should be waiting on ordered
extents. We already are ok on this front technically, because we always
do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
the preemptive flushing code as well, so change this to count all
ordered bytes instead of just DIO ordered bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9db4dc24 10-Jan-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_start_delalloc_root's nr argument a long

It's currently u64 which gets instantly translated either to LONG_MAX
(if U64_MAX is passed) or cast to an unsigned long (which is in fact,
wrong because writeback_control::nr_to_write is a signed, long type).

Just convert the function's argument to be long time which obviates the
need to manually convert u64 value to a long. Adjust all call sites
which pass U64_MAX to pass LONG_MAX. Finally ensure that in
shrink_delalloc the u64 is converted to a long without overflowing,
resulting in a negative number.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 69948022 07-Dec-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove new_dirid argument from btrfs_create_subvol_root

It's no longer used. While at it also remove new_dirid in create_subvol
as it's used in a single place and open code it. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b8fad57 07-Dec-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: rename btrfs_root::highest_objectid to free_objectid

This reflects the true purpose of the member as it's being used solely
in context where a new objectid is being allocated. Future changes will
also change the way it's being used to closely follow this semantics.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2f96e402 15-Jan-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: fix possible free space tree corruption with online conversion

While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a0a1db70 14-Dec-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between RO remount and the cleaner task

When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.

The following sequence of steps explains how the race happens.

1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;

2) The cleaner task is currently running delayed iputs for example;

3) A filesystem RO remount operation starts;

4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;

5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:

* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();

* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();

* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();

6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;

7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;

8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.

So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();

9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;

10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.

This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a664971 01-Dec-2020 ethanwu <ethanwu@synology.com>

btrfs: correctly calculate item size used when item key collision happens

Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.

item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.

When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.

And split_leaf return EOVERFLOW from following code:

if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;

In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.

However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:

data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))

the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.

The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly

[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00

Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.

There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.

The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.

1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks

to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.

This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d45f221 02-Dec-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix deadlock when cloning inline extent and low on free metadata space

When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:

* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;

* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;

* the async reclaim task blocks waiting for the delalloc work to complete;

* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;

* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.

When this happens, traces like the following show up in dmesg/syslog:

[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)

Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).

Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.

This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.

CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6275193e 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs

Refactor btrfs_lookup_bio_sums() by:

- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:

* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset

* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.

Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.

- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length

After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:

for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {

/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}

- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.

All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.

The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 884b07d0 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors

To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.

Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.

This patch introduces two helpers to do this:

- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".

For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.

- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.

For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.

This patch will touch the following parts to cover all extent buffer
accessors:

- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# deb67895 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: calculate inline extent buffer page size based on page size

Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.

For a system using 64K page size, we would really have just one page.

While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.

Calculate the array size based on page size and the node size maximum.

- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page

Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7ffd27e3 01-Dec-2020 Qu Wenruo <wqu@suse.com>

btrfs: pass bio_offset to check_data_csum() directly

Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().

Both parameters are calculated values for csum lookup.

Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.

Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.

Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94846229 18-Nov-2020 Boris Burkov <boris@bur.io>

btrfs: keep sb cache_generation consistent with space_cache

When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.

In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.

We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).

Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 47876f7c 24-Nov-2020 Filipe Manana <fdmanana@suse.com>

btrfs: do not block inode logging for so long during transaction commit

Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.

This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).

Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:

1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers

They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.

So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.

This patch belongs to a patch set that is comprised of the following
patches:

btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit

The following script that uses dbench was used to measure the impact of
the whole patchset:

$ cat test-dbench.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"

echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 300 64

umount $MNT

The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).

Before patch set:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236

Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms

After patch set:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259

Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms

+12.7% throughput, -8.2% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5297199a 26-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove inode number cache feature

It's been deprecated since commit b547a88ea577 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.

A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.

[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 862931c7 10-Nov-2020 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: introduce max_zone_append_size

The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.

Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b70f5097 10-Nov-2020 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: check and enable ZONED mode

Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 729f7961 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_update_inode_fallback take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b06359a3 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_cont_expand take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 217f42eb 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_truncate_block take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a56fcd1 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_update_inode take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 50743398 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_truncate_inode_items take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 76aea537 02-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2f5239dc 06-Nov-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: remove btrfs_path::recurse

With my async free space cache loading patches ("btrfs: load free space
cache asynchronously") we no longer have a user of path->recurse and can
remove it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2766ff61 04-Nov-2020 Filipe Manana <fdmanana@suse.com>

btrfs: update the number of bytes used by an inode atomically

There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.

In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:

https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

The cases where this can happen are the following:

-> Case 1

If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).

This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.

If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).

Example reproducer:

$ cat reproducer-1.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got

while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT

xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)

# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!

for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done

kill $loop_pid &> /dev/null
wait

umount $DEV

$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 2

If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.

This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.

Example reproducer:

$ cat reproducer-2.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got

while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT

touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)

# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar

kill $loop_pid &> /dev/null
wait $loop_pid
done

umount $DEV

$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 3

Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.

The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.

Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.

Example reproducer:

$ cat reproducer-3.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT

extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))

# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null

expected=$(stat -c %b $MNT/foo)

# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done

umount $DEV

$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

So fix this by:

1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;

2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;

3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.

An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5893dfb9 04-Nov-2020 Filipe Manana <fdmanana@suse.com>

btrfs: refactor btrfs_drop_extents() to make it easier to extend

There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit 1acae57b161e ("Btrfs: faster file extent item replace operations")
and therefore know firsthand that it is a bit cumbersome to add additional
arguments to these functions.

Since I will need to add more arguments in a subsequent bug fix, this
change is preparatory work and adds a data structure that holds all the
arguments, for both input and output, that are passed to this function,
with some comments in the structure's definition mentioning what each
field is and how it relates to other fields.

Callers of this function need only to zero out the content of the
structure and setup only the fields they need. This also removes the
need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
now we have a single function named btrfs_drop_extents() that takes a
pointer to this new data structure (struct btrfs_drop_extents_args).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# df903e5d 04-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

btrfs: don't miss async discards after scheduled work override

If btrfs_discard_schedule_work() is called with override=true, it sets
delay anew regardless how much time is left until the timer should have
fired. If delays are long (that can happen, for example, with low
kbps_limit), they might get constantly overridden without having a
chance to run the discard work.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6e88f116 04-Nov-2020 Pavel Begunkov <asml.silence@gmail.com>

btrfs: discard: store async discard delay as ns not as jiffies

Most delay calculations are done in ns or ms, so store
discard_ctl->delay in ms and convert the final delay to jiffies only at
the end.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 478ef886 21-Oct-2020 Qu Wenruo <wqu@suse.com>

btrfs: make buffer_radix take sector size units

For subpage sector size support, one page can contain multiple tree
blocks. The entries cannot be based on page size and index must be
derived from the sectorsize. No change for page size == sector size.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 27d56e62 23-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: update last_byte_to_unpin in switch_commit_roots

While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario

commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7

new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7

Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.

The race can happen like:

1) The caching thread was running using the old commit root when it
found the extent for [2, 3);

2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);

3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).

So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).

In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).

Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.

CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# b9729ce0 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: locking: rip out path->leave_spinning

We no longer distinguish between blocking and spinning, so rip out all
this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fe5ecbe8 02-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: precalculate checksums per leaf once

btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
candidate for optimizations. After the 64bit division has been replaced
by shift, there's still a calculation done each time the function is
called: checksums per leaf.

As this is a constant value for the entire filesystem lifetime, we
can calculate it once at mount time and reuse. This also allows to
reduce the division to 64bit/32bit as we know the constant will always
fit the 32bit type.

Replace the open-coded rounding up with a macro that internally handles
the 64bit division and as it's now a short function, make it static
inline (slight code increase, slight stack usage reduction).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 22b6331d 02-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: store precalculated csum_size in fs_info

In many places we need the checksum size and it is inefficient to read
it from the raw superblock. Store the value into fs_info, actual use
will be in followup patches. The size is u32 as it allows to generate
better assembly than with u16.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 265fdfa6 01-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits

The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab108d99 01-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: use precalculated sectorsize_bits from fs_info

We do a lot of calculations where we divide or multiply by sectorsize.
We also know and make sure that sectorsize is a power of two, so this
means all divisions can be turned to shifts and avoid eg. expensive
u64/u32 divisions.

The type is u32 as it's more register friendly on x86_64 compared to u8
and the resulting assembly is smaller (movzbl vs movl).

There's also superblock s_blocksize_bits but it's usually one more
pointer dereference farther than fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c8422684 15-Sep-2020 David Sterba <dsterba@suse.com>

btrfs: add set/get accessors for root_item::drop_level

The drop_level member is used directly unlike all the other int types in
root_item. Add the definition and use it everywhere. The type is u8 so
there's no conversion necessary and the helpers are properly inlined,
this is for consistency.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ecfdc08b 24-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: remove dio iomap DSYNC workaround

This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a14b78ad 24-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: introduce btrfs_inode_lock()/unlock()

btrfs_inode_lock/unlock() are wrappers around inode locks, separating
the type of lock and actual locking.

- 0 - default, exclusive lock
- BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO
- BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence

The bits SHARED and TRY can be combined together.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4e4cabec 24-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: split btrfs_direct_IO to read and write

The read and write DIO don't have anything in common except for the
call to iomap_dio_rw. Extract the write call into a new function to get
rid of conditional statements for direct write.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 882dbe0c 16-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce mount option rescue=ignoredatacsums

There are cases where you can end up with bad data csums because of
misbehaving applications. This happens when an application modifies a
buffer in-flight when doing an O_DIRECT write. In order to recover the
file we need a way to turn off data checksums so you can copy the file
off, and then you can delete the file and restore it properly later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42437a63 16-Oct-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce mount option rescue=ignorebadroots

In the face of extent root corruption, or any other core fs wide root
corruption we will fail to mount the file system. This makes recovery
kind of a pain, because you need to fall back to userspace tools to
scrape off data. Instead provide a mechanism to gracefully handle bad
roots, so we can at least mount read-only and possibly recover data from
the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# aa8c1a41 14-Oct-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()

Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as
opposed to calling set_extent_bits again later.

Fold check for written length within the function.

Note: EXTENT_NORESERVE is set before unlocking extents.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a855fbe6 23-Nov-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix lockdep splat when enabling and disabling qgroups

When running test case btrfs/017 from fstests, lockdep reported the
following splat:

[ 1297.067385] ======================================================
[ 1297.067708] WARNING: possible circular locking dependency detected
[ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
[ 1297.068322] ------------------------------------------------------
[ 1297.068629] btrfs/189080 is trying to acquire lock:
[ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.069274]
but task is already holding lock:
[ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.070219]
which lock already depends on the new lock.

[ 1297.071131]
the existing dependency chain (in reverse order) is:
[ 1297.071721]
-> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
[ 1297.072375] lock_acquire+0xd8/0x490
[ 1297.072710] __mutex_lock+0xa3/0xb30
[ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
[ 1297.073421] create_subvol+0x194/0x990 [btrfs]
[ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
[ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
[ 1297.075245] __x64_sys_ioctl+0x83/0xb0
[ 1297.075617] do_syscall_64+0x33/0x80
[ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.076380]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 1297.077166] check_prev_add+0x91/0xc60
[ 1297.077572] __lock_acquire+0x1740/0x3110
[ 1297.077984] lock_acquire+0xd8/0x490
[ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.079789] __x64_sys_ioctl+0x83/0xb0
[ 1297.080232] do_syscall_64+0x33/0x80
[ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.081139]
other info that might help us debug this:

[ 1297.082536] Possible unsafe locking scenario:

[ 1297.083510] CPU0 CPU1
[ 1297.084005] ---- ----
[ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.084994] lock(sb_internal#2);
[ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.085974] lock(sb_internal#2);
[ 1297.086454]
*** DEADLOCK ***
[ 1297.087880] 3 locks held by btrfs/189080:
[ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
[ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.089771]
stack backtrace:
[ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
[ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1297.092123] Call Trace:
[ 1297.092629] dump_stack+0x8d/0xb5
[ 1297.093115] check_noncircular+0xff/0x110
[ 1297.093596] check_prev_add+0x91/0xc60
[ 1297.094076] ? kvm_clock_read+0x14/0x30
[ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
[ 1297.095029] __lock_acquire+0x1740/0x3110
[ 1297.095510] lock_acquire+0xd8/0x490
[ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.098904] ? do_user_addr_fault+0x20c/0x430
[ 1297.099382] ? kvm_clock_read+0x14/0x30
[ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
[ 1297.100328] ? sched_clock+0x5/0x10
[ 1297.100801] ? sched_clock_cpu+0x12/0x180
[ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
[ 1297.101739] __x64_sys_ioctl+0x83/0xb0
[ 1297.102207] do_syscall_64+0x33/0x80
[ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.103148] RIP: 0033:0x7f773ff65d87

This is because during the quota enable ioctl we lock first the mutex
qgroup_ioctl_lock and then start a transaction, and starting a transaction
acquires a fs freeze semaphore (at the VFS level). However, every other
code path, except for the quota disable ioctl path, we do the opposite:
we start a transaction and then lock the mutex.

So fix this by making the quota enable and disable paths to start the
transaction without having the mutex locked, and then, after starting the
transaction, lock the mutex and check if some other task already enabled
or disabled the quotas, bailing with success if that was the case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e8f147dc 03-Nov-2020 Thomas Gleixner <tglx@linutronix.de>

fs: Remove asm/kmap_types.h includes

Historical leftovers from the time where kmap() had fixed slots.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20201103095856.870272797@linutronix.de


# 66d204a1 12-Oct-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix readahead hang and use-after-free after removing a device

Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;

while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()

for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)

$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'

return ret

def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)

$ drgn dump_block_groups.py

block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6

block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6

block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6

block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6

block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6

block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6

block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.

It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.

As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().

Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().

At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.

This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 905eb88b 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: remove struct extent_io_ops

It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.

Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 908930f3 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: stop calling submit_bio_hook for data inodes

Instead export and rename the function to btrfs_submit_data_bio and
call it directly in submit_one_bio. This avoids paying the cost for
speculative attacks mitigations and improves code readability.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a446d6a 18-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: replace readpage_end_io_hook with direct calls

Don't call readpage_end_io_hook for the btree inode. Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.

This is an improvement in 2 directions:

1. We aren't paying the penalty of indirect calls in a post-speculation
attacks world.

2. The function is now named more explicitly so it's obvious what's
going on

This is in preparation to removing struct extent_io_ops altogether.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e97659ce 15-Sep-2020 David Sterba <dsterba@suse.com>

btrfs: use unaligned helpers for stack and header set/get helpers

In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
pointer assignment but we should use the helpers for unaligned access
for clarity. It hasn't been a problem so far because of the natural
alignment.

Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
from stack that has an aligned start but some members may not be aligned
due to packing. This as well hasn't caused problems so far.

Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.

Signed-off-by: David Sterba <dsterba@suse.com>


# fc0d82e1 01-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: sink total_data parameter in setup_items_for_insert

That parameter can easily be derived based on the "data_size" and "nr"
parameters exploit this fact to simply the function's signature. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3dc9dc89 01-Sep-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: eliminate total_size parameter from setup_items_for_insert

The value of this argument can be derived from the total_data as it's
simply the value of the data size + size of btrfs_items being touched.
Move the parameter calculation inside the function. This results in a
simpler interface and also a minor size reduction:

./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
Function old new delta
btrfs_duplicate_item 260 259 -1
setup_items_for_insert 1200 1190 -10
btrfs_insert_empty_items 177 154 -23

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 306bfec0 08-Sep-2020 Filipe Manana <fdmanana@suse.com>

btrfs: rename btrfs_punch_hole_range() to a more generic name

The function btrfs_punch_hole_range() is now used to replace all the file
extents in a given file range with an extent described in the given struct
btrfs_replace_extent_info argument. This extent can either be an existing
extent that is being cloned or it can be a new extent (namely a prealloc
extent). When that argument is NULL it only punches a hole (drops all the
existing extents) in the file range.

So rename the function to btrfs_replace_file_extents().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bf385648 08-Sep-2020 Filipe Manana <fdmanana@suse.com>

btrfs: rename struct btrfs_clone_extent_info to a more generic name

Now that we can use btrfs_clone_extent_info to convey information for a
new prealloc extent as well, and not just for existing extents that are
being cloned, rename it to btrfs_replace_extent_info, which reflects the
fact that this is now more generic and it is used to replace all existing
extents in a file range with the extent described by the structure.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb870f6c 08-Sep-2020 Filipe Manana <fdmanana@suse.com>

btrfs: remove item_size member of struct btrfs_clone_extent_info

The value of item_size of struct btrfs_clone_extent_info is always set to
the size of a non-inline file extent item, and in fact the infrastructure
that uses this structure (btrfs_punch_hole_range()) does not work with
inline file extents at all (and it is not supposed to).

So just remove that field from the structure and use directly
sizeof(struct btrfs_file_extent_item) instead. Also assert that the
file extent type is not inline at btrfs_insert_clone_extent().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8fccebfa 08-Sep-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix metadata reservation for fallocate that leads to transaction aborts

When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:

[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left

When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.

When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.

However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:

$ cat reproducer.sh
#!/bin/bash

umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj

# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done

# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync

xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar

umount /mnt/sdj

$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device

$ dmesg
<shows the same stack trace pasted before>

To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.

A test case for fstests will follow up soon.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c3e1f96c 25-Aug-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: enumerate the type of exclusive operation in progress

Instead of using a flag bit for exclusive operation, use a variable to
store which exclusive operation is being performed. Introduce an API
to start and finish an exclusive operation.

This would enable another way for tools to check which operation is
running on why starting an exclusive operation failed. The followup
patch adds a sysfs_notify() to alert userspace when the state changes, so
userspace can perform select() on it to get notified of the change.

This would enable us to enqueue a command which will wait for current
exclusive operation to complete before issuing the next exclusive
operation. This has been done synchronously as opposed to a background
process, or else error collection (if any) will become difficult.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 6fee248d 31-Aug-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: convert btrfs_inode_sectorsize to take btrfs_inode

It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9631e4cc 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks

When we COW a block we are holding a lock on the original block, and
then we lock the new COW block. Because our lockdep maps are based on
root + level, this will make lockdep complain. We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 51899412 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce btrfs_path::recurse

Our current tree locking stuff allows us to recurse with read locks if
we're already holding the write lock. This is necessary for the space
cache inode, as we could be holding a lock on the root_tree root when we
need to cache a block group, and thus need to be able to read down the
root_tree to read in the inode cache.

We can get away with this in our current locking, but we won't be able
to with a rwsem. Handle this by purposefully annotating the places
where we require recursion, so that in the future we can maybe come up
with a way to avoid the recursion. In the case of the free space inode,
this will be superseded by the free space tree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e85fde51 24-Jul-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations

[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:

generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)

And with the following metadata leak:

BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents

[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:

btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072

It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().

This leads to the leakage.

[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.

And free the qgroup reserved metadata space when releasing the
block_rsv.

To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.

Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f85781fb 17-Aug-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: switch to iomap for direct IO

We're using direct io implementation based on buffer heads. This patch
switches to the new iomap infrastructure.

Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename
btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
iomap_begin() for iomap direct I/O functions. This function allocates
and locks all the blocks required for the I/O. btrfs_submit_direct() is
used as the submit_io() hook for direct I/O ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore: set it to noop.

Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:

- for writes, call __endio_write_update_ordered()
- for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 57056740 21-Jul-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: do async reclaim for data reservations

Now that we have the data ticketing stuff in place, move normal data
reservations to use an async reclaim helper to satisfy tickets. Before
we could have multiple tasks race in and both allocate chunks, resulting
in more data chunks than we would necessarily need. Serializing these
allocations and making a single thread responsible for flushing will
only allocate chunks as needed, as well as cut down on transaction
commits and other flush related activities.

Priority reservations will still work as they have before, simply
trying to allocate a chunk until they can make their reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 058e6d1d 21-Jul-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: add flushing states for handling data reservations

Currently the way we do data reservations is by seeing if we have enough
space in our space_info. If we do not and we're a normal inode we'll

1) Attempt to force a chunk allocation until we can't anymore.
2) If that fails we'll flush delalloc, then commit the transaction, then
run the delayed iputs.

If we are a free space inode we're only allowed to force a chunk
allocation. In order to use the normal flushing mechanism we need to
encode this into a flush state array for normal inodes. Since both will
start with allocating chunks until the space info is full there is no
need to add this as a flush state, this will be handled specially.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b4912139 21-Jul-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: change nr to u64 in btrfs_start_delalloc_roots

We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.

This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.

Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a84d5d42 18-Aug-2020 Boris Burkov <boris@bur.io>

btrfs: detect nocow for swap after snapshot delete

can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.

The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.

Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.

CC: stable@vger.kernel.org # 5.4+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 604997b4 27-Jul-2020 David Sterba <dsterba@suse.com>

btrfs: use the correct const function attribute for btrfs_get_num_csums

The build robot reports

compiler: h8300-linux-gcc (GCC) 9.3.0
In file included from fs/btrfs/tests/extent-map-tests.c:8:
>> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2166 | size_t __const btrfs_get_num_csums(void);
| ^~~~~~~

The function attribute for const does not follow the expected scheme and
in this case is confused with a const type qualifier.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f95ebdbe 21-Jul-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: don't WARN if we abort a transaction with EROFS

If we got some sort of corruption via a read and call
btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
complain. If a subsequent trans handle trips over this it'll get EROFS
and then abort. However at that point we're not aborting for the
original reason, we're aborting because we've been flipped read only.
We do not need to WARN_ON() here.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd7fb634 12-Jul-2020 Qu Wenruo <wqu@suse.com>

btrfs: add comments for btrfs_reserve_flush_enum

This enum is the interface exposed to developers.

Although we have a detailed comment explaining the whole idea of space
flushing at the beginning of space-info.c, the exposed enum interface
doesn't have any comment.

Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
not explained at all.

So add some simple comments for these enums as a quick reference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# adca4d94 13-Jul-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT

commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and tries to wake up commit
kthread to free some space.

The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.

Now since we have the ability to flush qgroup space, and implemented
retry-after-EDQUOT behavior, such mechanism can be completely replaced.

So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c53e9653 13-Jul-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.

btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:

- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.

- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.

- Commit transaction
This will free all per-trans metadata space.

Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 60f8667b 06-Jul-2020 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: add multi-statement protection to btrfs_set/clear_and_info macros

Multi-statement macros should be enclosed in do/while(0) block to make
their use safe in single statement if conditions. All current uses of
the macros are safe, so this change is for future protection.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a93e0168 01-Jul-2020 Filipe Manana <fdmanana@suse.com>

btrfs: remove no longer needed use of log_writers for the log root tree

When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.

We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.

The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.

However after commit 06989c799f0481 ("Btrfs: fix race updating log root
item during fsync") the log root tree update was moved into a critical
section delimited by the subvolume's log_mutex. Later another commit
moved the log tree update from that critical section into the second
critical section delimited by the log_mutex of the log root tree. Both
commits addressed different bugs.

The end result is that the first critical section delimited by the
log_mutex of the log root tree became pointless, since there's nothing
done between it and the second critical section, we just have an unlock
of the log_mutex followed by a lock operation. This means we can merge
both critical sections, as the first one does almost nothing now, and we
can stop using the log_writers counter of the log root tree, which was
incremented in the first critical section and decremented in the second
criticial section, used to make sure no one in the second critical section
started writeback of the log root tree before some other task updated it.

So just remove the mutex_unlock() followed by mutex_lock() of the log root
tree, as well as the use of the log_writers counter for the log root tree.

This patch is part of a series that has the following patches:

1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree

After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:

mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28a95795 01-Jul-2020 Filipe Manana <fdmanana@suse.com>

btrfs: stop incremening log_batch for the log root tree when syncing log

We are incrementing the log_batch atomic counter of the root log tree but
we never use that counter, it's used only for the log trees of subvolume
roots. We started doing it when we moved the log_batch and log_write
counters from the global, per fs, btrfs_fs_info structure, into the
btrfs_root structure in commit 7237f1833601dc ("Btrfs: fix tree logs
parallel sync").

So just stop doing it for the log root tree and add a comment over the
field declaration so inform it's used only for log trees of subvolume
roots.

This patch is part of a series that has the following patches:

1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree

After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:

mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 49e5fb46 27-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: export qgroups in sysfs

This patch will add the following sysfs interface:

/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags

Which is also available in output of "btrfs qgroup show".

/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc

The last 3 rsv related members are not visible to users, but can be very
useful to debug qgroup limit related bugs.

Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
level and qgroup id is changed to '_'.

The interface is not hidden behind 'debug' as we want this interface to
be included into production build and to provide another way to read the
qgroup information besides the ioctls.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 088545f6 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_dirty_pages take btrfs_inode

There is a single use of the generic vfs_inode so let's take btrfs_inode
as a parameter and remove couple of redundant BTRFS_I() calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2566f22 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_set_extent_delalloc take btrfs_inode

Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 98456b9c 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_run_delalloc_range take btrfs_inode

All children now take btrfs_inode so convert it to taking it as a
parameter as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 38d37aa9 23-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: refactor btrfs_check_can_nocow() into two variants

The function btrfs_check_can_nocow() now has two completely different
call patterns.

For nowait variant, callers don't need to do any cleanup. While for
wait variant, callers need to release the lock if they can do nocow
write.

This is somehow confusing, and is already a problem for the exported
btrfs_check_can_nocow().

So this patch will separate the different patterns into different
functions.
For nowait variant, the function will be called check_nocow_nolock().
For wait variant, the function pair will be btrfs_check_nocow_lock()
btrfs_check_nocow_unlock().

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6d4572a9 23-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation

[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.

The following script can reproduce it pretty easily:
#!/bin/bash

dev=/dev/test/test
mnt=/mnt/btrfs

umount $dev &> /dev/null
umount $mnt &> /dev/null

mkfs.btrfs -f $dev -b 1G
mount -o nospace_cache $dev $mnt
touch $mnt/foobar
chattr +C $mnt/foobar

xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
sync

xfs_io -c "fpunch 0 2k" $mnt/foobar
umount $mnt

Currently this will fail at the fpunch part.

[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.

Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().

[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.

Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.

This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.

Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a2570ef3 23-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: remove unused btrfs_root::defrag_trans_start

Last touched in 2013 by commit de78b51a2852 ("btrfs: remove cache only
arguments from defrag path") that was the only code that used the value.
Now it's only set but never used for anything, so we can remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 906c448c 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make __btrfs_drop_extents take btrfs_inode

It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
interface with the non __ prefixed version. Will also makes converting
its callers to btrfs_inode easier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd242a08 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_csum_one_bio takae btrfs_inode

Will enable converting btrfs_submit_compressed_write to btrfs_inode more
easily.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7bfa9535 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_reloc_clone_csums take btrfs_inode

It really wants btrfs_inode and not a vfs inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ce6ef5ab 08-Jun-2020 David Sterba <dsterba@suse.com>

btrfs: add little-endian optimized key helpers

The CPU and on-disk keys are mapped to two different structures because
of the endianness. There's an intermediate buffer used to do the
conversion, but this is not necessary when CPU and on-disk endianness
match.

Add optimized versions of helpers that take disk_key and use the buffer
directly for CPU keys or drop the intermediate buffer and conversion.

This saves a lot of stack space accross many functions and removes about
6K of generated binary code:

text data bss dec hex filename
1090439 17468 14912 1122819 112203 pre/btrfs.ko
1084613 17456 14912 1116981 110b35 post/btrfs.ko

Delta: -5826

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 203f44c5 09-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: inode: refactor the parameters of insert_reserved_file_extent()

Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.

This makes the parameter list unnecessary long for a function which only
gets called twice.

This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.

Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().

These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.

So use two small wrapper for these two different call sites to improve
readability.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e7a79811 15-Jun-2020 Filipe Manana <fdmanana@suse.com>

btrfs: check if a log root exists before locking the log_mutex on unlink

This brings back an optimization that commit e678934cbe5f02 ("btrfs:
Remove unnecessary check from join_running_log_trans") removed, but in
a different form. So it's almost equivalent to a revert.

That commit removed an optimization where we avoid locking a root's
log_mutex when there is no log tree created in the current transaction.
The affected code path is triggered through unlink operations.

That commit was based on the assumption that the optimization was not
necessary because we used to have the following checks when the patch
was authored:

int btrfs_del_dir_entries_in_log(...)
{
(...)
if (dir->logged_trans < trans->transid)
return 0;

ret = join_running_log_trans(root);
(...)
}

int btrfs_del_inode_ref_in_log(...)
{
(...)
if (inode->logged_trans < trans->transid)
return 0;

ret = join_running_log_trans(root);
(...)
}

However before that patch was merged, another patch was merged first which
replaced those checks because they were buggy.

That other patch corresponds to commit 803f0f64d17769 ("Btrfs: fix fsync
not persisting dentry deletions due to inode evictions"). The assumption
that if the logged_trans field of an inode had a smaller value then the
current transaction's generation (transid) meant that the inode was not
logged in the current transaction was only correct if the inode was not
evicted and reloaded in the current transaction. So the corresponding bug
fix changed those checks and replaced them with the following helper
function:

static bool inode_logged(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
{
if (inode->logged_trans == trans->transid)
return true;

if (inode->last_trans == trans->transid &&
test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
!test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
return true;

return false;
}

So if we have a subvolume without a log tree in the current transaction
(because we had no fsyncs), every time we unlink an inode we can end up
trying to lock the log_mutex of the root through join_running_log_trans()
twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
and once for the parent directory (with btrfs_del_dir_entries_in_log()).

This means if we have several unlink operations happening in parallel for
inodes in the same subvolume, and the those inodes and/or their parent
inode were changed in the current transaction, we end up having a lot of
contention on the log_mutex.

The test robots from intel reported a -30.7% performance regression for
a REAIM test after commit e678934cbe5f02 ("btrfs: Remove unnecessary check
from join_running_log_trans").

So just bring back the optimization to join_running_log_trans() where we
check first if a log root exists before trying to lock the log_mutex. This
is done by checking for a bit that is set on the root when a log tree is
created and removed when a log tree is freed (at transaction commit time).

Commit e678934cbe5f02 ("btrfs: Remove unnecessary check from
join_running_log_trans") was merged in the 5.4 merge window while commit
803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to
inode evictions") was merged in the 5.3 merge window. But the first
commit was actually authored before the second commit (May 23 2019 vs
June 19 2019).

Reported-by: kernel test robot <rong.a.chen@intel.com>
Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
Fixes: e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55e20bd1 09-Jun-2020 David Sterba <dsterba@suse.com>

Revert "btrfs: switch to iomap_dio_rw() for dio"

This reverts commit a43a67a2d715540c1368b9501a22b0373b5874c0.

This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.

The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.

Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.

There have been discussions how to fix that, but revert seems to be the
least intrusive option.

Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <dsterba@suse.com>


# f4c48b44 09-Jun-2020 David Sterba <dsterba@suse.com>

Revert "btrfs: split btrfs_direct_IO to read and write part"

This reverts commit d8f3e73587ce574f7a9bc165e0db69b0b148f6f8.

The patch is a cleanup of direct IO port to iomap infrastructure,
which gets reverted.

Signed-off-by: David Sterba <dsterba@suse.com>


# d8f3e735 19-May-2020 Christoph Hellwig <hch@lst.de>

btrfs: split btrfs_direct_IO to read and write part

The read and write versions don't have anything in common except for the
call to iomap_dio_rw. So split this function, and merge each half into
its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a43a67a2 19-May-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: switch to iomap_dio_rw() for dio

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:
- for writes, call __endio_write_update_ordered()
- for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e289f03e 17-May-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents

When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.

For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:

1) Task A starts an fsync on inode A;

2) Task B starts an fsync on inode B;

3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;

4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());

5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;

6) Task B releases the path;

7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;

8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;

9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.

Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.

So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.

Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:

27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"

Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.

This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:

BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---

In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.

Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:

BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0202e83f 15-May-2020 David Sterba <dsterba@suse.com>

btrfs: simplify iget helpers

The inode lookup starting at btrfs_iget takes the full location key,
while only the objectid is used to match the inode, because the lookup
happens inside the given root thus the inode number is unique.
The entire location key is properly set up in btrfs_init_locked_inode.

Simplify the helpers and pass only inode number, renaming it to 'ino'
instead of 'objectid'. This allows to remove temporary variables key,
saving some stack space.

Signed-off-by: David Sterba <dsterba@suse.com>


# aeb935a4 15-May-2020 Qu Wenruo <wqu@suse.com>

btrfs: don't set SHAREABLE flag for data reloc tree

SHAREABLE flag is set for subvolumes because users can create snapshot
for subvolumes, thus sharing tree blocks of them.

But data reloc tree is not exposed to user space, as it's only an
internal tree for data relocation, thus it doesn't need the full path
replacement handling at all.

This patch will make data reloc tree a non-shareable tree, and add
btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
can grab it from fs_info directly.

This would slightly improve tree relocation, as now data reloc tree
can go through regular COW routine to get relocated, without bothering
the complex tree reloc tree routine.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 92a7cc42 15-May-2020 Qu Wenruo <wqu@suse.com>

btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE

The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.

In fact, that bit can only be set to those trees:

- Subvolume roots
- Data reloc root
- Reloc roots for above roots

All other trees won't get this bit set. So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees. Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).

This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b48966a 28-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: constify extent_buffer in the API functions

There are many helpers around extent buffers, found in extent_io.h and
ctree.h. Most of them can be converted to take constified eb as there
are no changes to the extent buffer structure itself but rather the
pages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 870b388d 29-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: preset set/get token with first page and drop condition

All the set/get helpers first check if the token contains a cached
address. After first use the address is always valid, but the extra
check is done for each call.

The token initialization can optimistically set it to the first extent
buffer page, that we know always exists. Then the condition in all
btrfs_token_*/btrfs_set_token_* can be simplified by removing the
address check from the condition, but for development the assertion
still makes sure it's valid.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cc4c13d5 28-Apr-2020 David Sterba <dsterba@suse.com>

btrfs: drop eb parameter from set/get token helpers

Now that all set/get helpers use the eb from the token, we don't need to
pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
stack space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 684b752b 08-May-2020 Filipe Manana <fdmanana@suse.com>

btrfs: move the block group freeze/unfreeze helpers into block-group.c

The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
used to be named btrfs_get_block_group_trimming() and
btrfs_put_block_group_trimming() respectively.

At the time they were added to free-space-cache.c, by commit e33e17ee1098
("btrfs: add missing discards when unpinning extents with -o discard")
because all the trimming related functions were in free-space-cache.c.

Now that the helpers were renamed and are used in scrub context as well,
move them to block-group.c, a much more logical location for them.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b7304af 08-May-2020 Filipe Manana <fdmanana@suse.com>

btrfs: rename member 'trimming' of block group to a more generic name

Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming
and block group remove/allocation"), I added the 'trimming' member to the
block group structure. Its purpose was to prevent races between trimming
and block group deletion/allocation by pinning the block group in a way
that prevents its logical address and device extents from being reused
while trimming is in progress for a block group, so that if another task
deletes the block group and then another task allocates a new block group
that gets the same logical address and device extents while the trimming
task is still in progress.

After the previous fix for scrub (patch "btrfs: fix a race between scrub
and block group removal/allocation"), scrub now also has the same needs that
trimming has, so the member name 'trimming' no longer makes sense.
Since there is already a 'pinned' member in the block group that refers
to space reservations (pinned bytes), rename the member to 'frozen',
add a comment on top of it to describe its general purpose and rename
the helpers to increment and decrement the counter as well, to match
the new member name.

The next patch in the series will move the helpers into a more suitable
file (from free-space-cache.c to block-group.c).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 31344b2f 11-May-2020 David Sterba <dsterba@suse.com>

btrfs: remove more obsolete v0 extent ref declarations

The extent references v0 have been superseded long time go, there are
some unused declarations of access helpers. We can safely remove them
now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
btrfs_extent_item_v0 is still part of a backward compatibility check in
relocation.c and thus not removed.

Signed-off-by: David Sterba <dsterba@suse.com>


# 943aeb0d 09-May-2020 YueHaibing <yuehaibing@huawei.com>

btrfs: remove unused function btrfs_dev_extent_chunk_tree_uuid

There's no callers in-tree anymore since
commit d24ee97b96db ("btrfs: use new helpers to set uuids in eb")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c047a69 16-Apr-2020 Omar Sandoval <osandov@fb.com>

btrfs: get rid of endio_repair_workers

This was originally added in commit 8b110e393c5a ("Btrfs: implement
repair function when direct read fails") to avoid a deadlock. In that
commit, the direct I/O read endio executes on the endio_workers
workqueue, submits a repair bio, and waits for it to complete. The
repair bio endio must execute on a different workqueue, otherwise it
could block on the endio_workers workqueue becoming available, which
won't happen because the original endio is blocked on the repair bio.

As of the previous commit, the original endio doesn't wait for the
repair bio, so this separate workqueue is unnecessary.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e3b83361 17-Apr-2020 Qu Wenruo <wqu@suse.com>

btrfs: remove the redundant parameter level in btrfs_bin_search()

All callers pass the eb::level so we can get read it directly inside the
btrfs_bin_search and key_search.

This is inspired by the work of Marek in U-boot.

CC: Marek Behun <marek.behun@nic.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7f9fe614 13-Mar-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: improve global reserve stealing logic

For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start an
ordinary transaction and if it fails it will fall back to reserving the
required amount by stealing from the global reserve. This is problematic
because of all the same reasons we had with previous iterations of the
ENOSPC handling, thundering herd. We get a bunch of failures all at
once, everybody tries to allocate from the global reserve, some win and
some lose, we get an ENSOPC.

Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure. We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve. If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 55465730 02-Mar-2020 Qu Wenruo <wqu@suse.com>

btrfs: backref: rename and move should_ignore_root()

This function is mostly single purpose to relocation backref cache, but
since we're moving the main part of backref cache to backref.c, we need
to export such function.

And to avoid confusion, rename the function to
btrfs_should_ignore_reloc_root() make the name a little more clear.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2433bea5 05-Mar-2020 Qu Wenruo <wqu@suse.com>

btrfs: reloc: make reloc root search-specific for relocation backref cache

find_reloc_root() searches reloc_control::reloc_root_tree to find the
reloc root. This behavior is only useful for relocation backref cache.

For the incoming more generic purpose backref cache, we don't care
about who owns the reloc root, but only care if it's a reloc root.

So this patch makes the following modifications to make the reloc root
search more specific to relocation backref:

- Add backref_node::is_reloc_root
This will be an extra indicator for generic purposed backref cache.
User doesn't need to read root key from backref_node::root to
determine if it's a reloc root.
Also for reloc tree root, it's useless and will be queued to useless
list.

- Add backref_cache::is_reloc
This will allow backref cache code to do different behavior for
generic purpose backref cache and relocation backref cache.

- Pass fs_info to find_reloc_root()

- Export find_reloc_root()
So backref.c can utilize this function.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c75e8394 14-Feb-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: kill the subvol_srcu

Now that we have proper root ref counting everywhere we can kill the
subvol_srcu.

* removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes

* the refcount_t used for the references checks for accidental 0->1
in cases where the root lifetime would not be properly protected

* there's a leak detector for roots to catch unfreed roots at umount
time

* SRCU served us well over the years but is was not a proper
synchronization mechanism for some cases

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 3fd63727 14-Feb-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: make the extent buffer leak check per fs info

I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a5eeb3d1 08-Mar-2020 Filipe Manana <fdmanana@suse.com>

btrfs: add helper to get the end offset of a file extent item

Getting the end offset for a file extent item requires a bit of code since
the extent can be either inline or regular/prealloc. There are some places
all over the code base that open code this logic and in another patch
later in this series it will be needed again. Therefore encapsulate this
logic in a helper function and use it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0078a9f9 10-Mar-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove block_rsv parameter from btrfs_drop_snapshot

It's no longer used following 30d40577e322 ("btrfs: reloc: Also queue
orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 726a3421 16-Feb-2020 Qu Wenruo <wqu@suse.com>

btrfs: relocation: add error injection points for cancelling balance

Introduce a new error injection point, should_cancel_balance().

It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
allows us to override the return value.

Currently there are only one locations using this function:

- btrfs_balance()
It checks cancel before each block group.

There are other locations checking fs_info->balance_cancel_req, but they
are not used as an indicator to exit, so there is no need to use the
wrapper.

But there will be more locations coming, and some locations can cause
kernel panic if not handled properly. So introduce this error injection
to provide better test interface.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6a177381 28-Feb-2020 Filipe Manana <fdmanana@suse.com>

Btrfs: move all reflink implementation code into its own file

The reflink code is quite large and has been living in ioctl.c since ever.
It has grown over the years after many bug fixes and improvements, and
since I'm planning on making some further improvements on it, it's time
to get it better organized by moving into its own file, reflink.c
(similar to what xfs does for example).

This change only moves the code out of ioctl.c into the new file, it
doesn't do any other change.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42c9d0b5 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: simplify parameters of btrfs_set_disk_extent_flags

All callers pass extent buffer start and length so the extent buffer
itself should work fine.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4ac7541 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: open code trivial helper btrfs_header_chunk_tree_uuid

The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9a8658e3 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: open code trivial helper btrfs_header_fsid

The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dcc3eb96 30-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: convert snapshot/nocow exlcusion to drew lock

This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2992df73 30-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Implement DREW lock

A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
that allows to have multiple readers or multiple writers but not
multiple readers and writers holding it concurrently.

The code is factored out from the existing open-coded locking scheme
used to exclude pending snapshots from nocow writers and vice-versa.
Current implementation actually favors Readers (that is snapshot
creaters) to writers (nocow writers of the filesystem).

The API provides lock/unlock/trylock for reads and writes.

Formal specification for TLA+ provided by Valentin Schneider is at
https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c0c907a4 21-Feb-2020 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: export helpers for subvolume name/id resolution

The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ split from the next patch ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 560b7a4a 18-Feb-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate

btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fe119a6e 20-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: switch to per-transaction pinned extents

This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.

Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9fce5704 20-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle

Preparation for refactoring pinned extents tracking.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7bfc1007 20-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_pin_reserved_extent take transaction handle

btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b25c36f8 20-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_pin_extent take trans handle

Preparation for switching pinned extent tracking to a per-transaction
basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd647ce3 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: add a leak check for roots

Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.

Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d4b0463 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: export and rename free_fs_info

We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 41a2ee75 17-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce per-inode file extent tree

In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.

Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7227ff4d 21-Jan-2020 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between adding and putting tree mod seq elements and nodes

There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.

Consider the following example that explains how/why the problems happens:

1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;

2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';

3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;

4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;

5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;

6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;

7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.

This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:

[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 68c467cb 16-Dec-2019 David Sterba <dsterba@suse.com>

btrfs: separate definition of assertion failure handlers

There's a report where objtool detects unreachable instructions, eg.:

fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction

This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.

To avoid that, use the ifdefs directly.

There are still 2 reports that aren't fixed:

fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction

Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9ddf648f 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: keep track of discard reuse stats

Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7fe6d45e 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: have multiple discard lists

Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.

This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 19b2a2c7 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: make max async discard size tunable

Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e93591bb 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: add kbps discard rate limit for async discard

Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a2309300 02-Jan-2020 Dennis Zhou <dennis@kernel.org>

btrfs: calculate discard delay based on number of extents

An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.

There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5dc7c10b 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: keep track of discardable_bytes for async discard

Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# dfb79ddb 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: track discardable extents for async discard

The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.

The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e4faab84 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: sysfs: add UUID/debug/discard directory

Setup base sysfs directory for discard stats + tunables.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 93945cb4 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: sysfs: make UUID/debug have its own kobject

Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.

This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6e80d4f8 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: handle empty block_group removal for async discard

block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.

Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.

The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.

The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b0643e59 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: add the beginning of async discard, discard workqueue

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701

discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 46b27f50 13-Dec-2019 Dennis Zhou <dennis@kernel.org>

btrfs: rename DISCARD mount option to to DISCARD_SYNC

This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 39b07b5d 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: drop create parameter to btrfs_get_extent()

We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.

While we're here, let's document btrfs_get_extent().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# db72e47f 10-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()

We can encode this in the offset parameter: -1 means use the page
offsets, anything else is a valid offset.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e62958fc 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappers

Currently, we have two wrappers for __btrfs_lookup_bio_sums():
btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and
btrfs_lookup_bio_sums(), which is used everywhere else. The only
difference is that the _dio variant looks up csums starting at the given
offset instead of using the page index, which isn't actually direct
I/O-specific. Let's clean up the signature and return value of
__btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get
rid of the trivial helpers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a0fbf736 21-Nov-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extent

__btrfs_free_reserved_extent now performs the actions of
btrfs_free_and_pin_reserved_extent. But this name is a bit of a
misnomer, since the extent is not really freed but just pinned. Reflect
this in the new name. No semantics changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 40e046ac 05-Dec-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix missing data checksums after replaying a log tree

When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.

Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:

[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb

[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb

[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb

When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.

Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:

(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536

The first one covers the range from the second one, they overlap.

So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.

However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.

After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:

[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb

[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb

[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb

[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb

After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:

1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();

2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);

3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;

4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.

5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.

The following steps reproduce this scenario:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc

$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar

$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar

$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar

<power failure>

$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error

$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1

Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.

This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32da5386 29-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cfbb825c 10-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: add incompat for raid1 with 3, 4 copies

The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8d6fac00 02-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: add support for 4-copy replication (raid1c4)

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.

The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.

Signed-off-by: David Sterba <dsterba@suse.com>


# 47e6f742 02-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: add support for 3-copy replication (raid1c3)

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.

The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0222dfdd 23-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename extent buffer block group item accessors

Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# de0dc456 23-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename block_group_item on-stack accessors to follow naming

All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.

Signed-off-by: David Sterba <dsterba@suse.com>


# b4e967be 08-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: add member for a specific checksum driver

Currently all the checksum algorithms generate a fixed size digest size
and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.

Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.

The only place that uses the driver name is the crypto API setup.

Signed-off-by: David Sterba <dsterba@suse.com>


# f7cea56c 07-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: sysfs: export supported checksums

Export supported checksum algorithms via sysfs in the list of static
features:

/sys/fs/btrfs/features/supported_checksums

Space spearated list of checksum algorithm names.

Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba8a9d07 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: delete the entire async bio submission framework

Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1f60a65 01-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: add __pure attribute to functions

The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .

The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4143cb8b 01-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: add const function attribute

For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c66e0d4 03-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: drop unused parameter is_new from btrfs_iget

The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1f95ec01 24-Sep-2019 David Sterba <dsterba@suse.com>

btrfs: move btrfs_unlock_up_safe to other locking functions

The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9c7d3a54 23-Sep-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent_io_tree defs to their own header

extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8702ba93 14-Oct-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()

[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.

PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.

[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.

The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);

We always free qgroup PREALLOC meta space.

While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));

We only free qgroup PREALLOC meta space when something went wrong.

[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.

The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()

- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.

E.g. for 64K buffered write, the metadata rsv works like:

/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */

/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */

/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)

This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().

The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e36a ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.

- The trans commit in advance for qgroup would hide the bug
Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.

[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 80ed4548 12-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: don't needlessly create extent-refs kernel thread

The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.

Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af024ed2 30-Aug-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: create structure to encode checksum type and length

Create a structure to encode the type and length for the known on-disk
checksums. This makes it easier to add new checksums later.

The structure and helpers are moved from ctree.h so they don't occupy
space in all headers including ctree.h. This save some space in the
final object.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c82f823c 09-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: tie extent buffer and it's token together

Further simplifaction of the get/set helpers is possible when the token
is uniquely tied to an extent buffer. A condition and an assignment can
be avoided.

The initializations are moved closer to the first use when the extent
buffer is valid. There's one exception in __push_leaf_left where the
token is reused.

Signed-off-by: David Sterba <dsterba@suse.com>


# cb495113 09-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: define separate btrfs_set/get_XX helpers

There are helpers for all type widths defined via macro and optionally
can use a token which is a cached pointer to avoid repeated mapping of
the extent buffer.

The token value is known at compile time, when it's valid it's always
address of a local variable, otherwise it's NULL passed by the
token-less helpers.

This can be utilized to remove some branching as the helpers are used
frequenlty.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6ff49c6a 27-Aug-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref

btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_extref or NULL it it didn't find anything. This
streamlines the function calling convention.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9bb8407f 27-Aug-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref struct

btrfs_find_name_in_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_ref or NULL it it didn't find anything. This
streamlines the function calling convention.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1dc990df 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move dev_stats helpers to volumes.c

The other dev stats functions are already there and the helpers are not
used by anything else.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 67b61aef 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move struct io_ctl to free-space-cache.h

The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 18d0f5c6 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move functions for tree compare to send.c

Send is the only user of tree_compare, we can move it there along with
the other helpers and definitions.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4b231ae4 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: rename and export read_node_slot

Preparatory work for code that will be moved out of ctree and uses this
function.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8a953348 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move private raid56 definitions from ctree.h

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 602cbe91 21-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move cond_wake_up functions out of ctree

The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3acd4850 21-Aug-2019 Christophe Leroy <christophe.leroy@c-s.fr>

btrfs: fix allocation of free space cache v1 bitmap pages

Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.

[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten

[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38

Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().

Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.

On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.

To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.

Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 2bd36e7b 22-Aug-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: rename the btrfs_calc_*_metadata_size helpers

btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0785a9aa 08-Aug-2019 Qu Wenruo <wqu@suse.com>

btrfs: tree-checker: Add EXTENT_DATA_REF check

EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its
key->offset.

This patch will check the following contents:
- Key->objectid
Basic alignment check.

- Hash
Hash of each extent_data_ref item must match key->offset.

- Offset
Basic alignment check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d3984c90 01-Aug-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: introduce an evict flushing state

We have this weird space flushing loop inside inode.c for evict where
we'll do the normal LIMIT flush, and then commit the transaction and
hope we get our space. This is super janky, and in fact there's really
nothing stopping us from using FLUSH_ALL except that we run delayed
iputs, which means we could deadlock. So introduce a new flush state
for eviction that does the normal priority flushing with all of the
states that are safe for eviction.

The nice side-effect of this is that we'll try harder for evictions.
Previously if (for example generic/269) you had a bunch of other
operations happening on the fs you could race with those reservations
when committing the transaction, and eventually miss getting a
reservation for the evict. With this code we'll have our ticket in
place through the transaction commit, so any pinned bytes will go to our
pending evictions first.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 844245b4 01-Aug-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: add a flush step for delayed iputs

Delayed iputs could very well free up enough space without needing to
commit the transaction, so make this step it's own step. This will
allow us to skip the step for evictions in a later patch.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3e43c279 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group cleanup code

This can now be easily migrated as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh on top of sysfs cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 878d7b67 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the alloc_profile helpers

These feel more at home in block-group.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, adjust btrfs_get_alloc_profile exports ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 07730d87 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the chunk allocation code

This feels more at home in block-group.c than in extent-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>i
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 77745c05 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the dirty bg writeout code

This can be easily migrated over now.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 26ce2095 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate inc/dec_block_group_ro code

This can easily be moved now.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4358d963 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group read/creation code

All of the prep work has been done so we can now cleanly move this chunk
over.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, add btrfs_get_alloc_profile export, comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e3e0520b 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group removal code

This is the removal code and the unused bgs code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, move clear_incompat_bg_bits ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f21246d 06-Aug-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group caching code

We can now just copy it over to block-group.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 67715b20 01-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: cleanup kobject.h includes

The kobject should be pulled in via sysfs.h and that needs to include it
because it needs various definitions like kobj_attribute or kobject.

Signed-off-by: David Sterba <dsterba@suse.com>


# 89439109 01-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: move sysfs declarations out of ctree.h

As the header for sysfs code already exists, use it to clean up ctree.h.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6f410d1b 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: export the excluded extents helpers

We'll need this to move the caching stuff around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3eeb3226 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate nocow and reservation helpers

These are relatively straightforward as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3cad1284 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group ref counting stuff

Another easy set to move over to block-group.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2e405ad8 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the block group lookup code

Move these bits first as they are the easiest to move. Export two of
the helpers so they can be moved all at once.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>


# aac0023c 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move basic block_group definitions to their own header

This is prep work for moving all of the block group cache code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 82253cb6 01-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused key type set/get helpers

The switch to open coded set/get has happend long time ago in
962a298f3511 ("btrfs: kill the key type accessor helpers"), remove the
stray helpers.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 112974d4 19-Jul-2019 Qu Wenruo <wqu@suse.com>

btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate()

[BUG]
Test case btrfs/156 fails since commit 302167c50b32 ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.

[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().

This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate

Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
used space, we can relocate

This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.

btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.

So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.

1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M 129M 257M 385M 513M 550M
|///////| |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)

However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |

2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.

The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.

The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.

However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.

[FIX]
The fix here is to remove btrfs_can_relocate() completely.

If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.

If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 330a5827 17-Jul-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove leftover of in-band dedupe

It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 690a5dbf 05-Jul-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents

When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.

In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.

A test case for fstests follows soon.

Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d7cd4dd9 07-Aug-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix sysfs warning and missing raid sysfs directories

In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:

[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---

These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().

We have this split raid kobject setup since commit 75cb379d263521
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:

1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.

2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.

Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:

1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().

2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.

So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().

Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406d0a9
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).

Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 86736342 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the delalloc space stuff to it's own home

We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb6dea26 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate btrfs_trans_release_chunk_metadata

Move this into transaction.c with the rest of the transaction related
code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6ef03deb 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the delayed refs rsv code

These belong with the delayed refs related code, not in extent-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d12ffdd1 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_block_rsv definitions into it's own header

Prep work for separating out all of the block_rsv related code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2a67a76 18-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: export block_rsv_use_bytes

We are going to need this to move the metadata reservation stuff to
space_info.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc471cb0 18-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: rename do_chunk_alloc to btrfs_chunk_alloc

Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8719aaae 18-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move space_info to space-info.h

Migrate the struct definition and the one helper that's in ctree.h into
space-info.h

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c9d713d5 13-Jun-2019 David Sterba <dsterba@suse.com>

btrfs: improve messages when updating feature flags

Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:

setting 128 feature flag

updated:

setting incompat feature flag for RAID56 (0x80)

Signed-off-by: David Sterba <dsterba@suse.com>


# 9e967495 22-Apr-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: prevent send failures and crashes due to concurrent relocation

Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:

[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984

Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:

[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---

Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.

To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 71a9c488 31-May-2019 David Sterba <dsterba@suse.com>

btrfs: document BTRFS_MAX_MIRRORS

The real meaning of that constant is not clear from the context due to
the target device inclusion.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6f8e4fd4 30-May-2019 David Sterba <dsterba@suse.com>

btrfs: use file:line format for assertion report

The filename:line format is commonly understood by editors and can be
copy&pasted more easily than the current format.

Signed-off-by: David Sterba <dsterba@suse.com>


# 6d97c6e3 03-Jun-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: add boilerplate code for directly including the crypto framework

Add boilerplate code for directly including the crypto framework. This
helps us flipping the switch for new algorithms.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e25a2e3 22-May-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: don't assume ordered sums to be 4 bytes

BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.

Change the data type to be a byte array and adjust loop index
calculation accordingly.

This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 65019df8 22-May-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: resurrect btrfs_crc32c()

Commit 9678c54388b6 ("btrfs: Remove custom crc32c init code") removed
the btrfs_crc32c() function, because it was a duplicate of the crc32c()
library function we already have in the kernel.

Resurrect it as a shim wrapper over crc32c() to make following
transformations of the checksumming code in btrfs easier.

Also provide a btrfs_crc32_final() to ease following transformations.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c8bf1b67 17-May-2019 David Sterba <dsterba@suse.com>

btrfs: remove mapping tree structures indirection

fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9b4e675a 16-May-2019 David Sterba <dsterba@suse.com>

btrfs: detect fast implementation of crc32c on all architectures

Currently, there's only check for fast crc32c implementation on X86,
based on the CPU flags. This is used to decide if checksumming should be
offloaded to worker threads or can be calculated by the caller.

As there are more architectures that implement a faster version of
crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw
cards.

The detection is based on driver name, all generic C implementations
contain 'generic', while the specialized versions do not. Alternatively
the priority could be used, but this is not currently provided by the
crypto API.

The flag is set per-filesystem at mount time and used for the offloading
decisions.

Signed-off-by: David Sterba <dsterba@suse.com>


# 26602cab 10-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

btrfs: use ->free_inode()

a lot of stuff remains in ->destroy_inode()

Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4297ff84 10-Apr-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: track DIO bytes in flight

When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc(). This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc(). However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.

Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents. If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ use dio instead of odirect in identifiers ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 62d54f3a 22-Apr-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between send and deduplication that lead to failures and crashes

Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:

1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).

2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.

However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).

That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].

The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:

[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27

The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:

[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---

That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).

Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f5c8daa5 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags

Signed-off-by: David Sterba <dsterba@suse.com>


# c71dd880 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter fs_info from btrfs_extend_item

Signed-off-by: David Sterba <dsterba@suse.com>


# 78ac4f9e 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter fs_info from btrfs_truncate_item

Signed-off-by: David Sterba <dsterba@suse.com>


# ffd4bb2a 04-Apr-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 82fa113f 04-Apr-2019 Qu Wenruo <wqu@suse.com>

btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()

Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 163e97ee 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from device in btrfs_scrub_cancel_dev

We can read fs_info from the device and can drop it from the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 32b593bf 17-Apr-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: remove no longer used function to run delayed refs asynchronously

It used to be called from only two places (truncate path and releasing a
transaction handle), but commits 28bad2125767c5 ("btrfs: fix truncate
throttling") and db2462a6ad3dc4 ("btrfs: don't run delayed refs in the end
transaction logic") removed their calls to this function, so it's not used
anymore. Just remove it and all its helpers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5742d15f 19-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from trans in btrfs_write_dirty_block_groups

We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# bbebb3e0 19-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from trans in btrfs_setup_space_cache

We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# e74e3993 27-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Factor out in_range macro

This is used in more than one places so let's factor it out in ctree.h.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1c11b63e 27-Mar-2019 Jeff Mahoney <jeffm@suse.com>

btrfs: replace pending/pinned chunks lists with io tree

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 496245ca 13-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: tree-checker: Verify inode item

There is a report in kernel bugzilla about mismatch file type in dir
item and inode item.

This inspires us to check inode mode in inode item.

This patch will check the following members:

- inode key objectid
Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.

- inode key offset
Should be 0

- inode item generation
- inode item transid
No newer than sb generation + 1.
The +1 is for log tree.

- inode item mode
No unknown bits.
No invalid S_IF* bit.
NOTE: S_IFMT check is not enough, need to check every know type.

- inode item nlink
Dir should have no more link than 1.

- inode item flags

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 90b1377d 27-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: qgroup: remove obsolete fs_info members

The commit fcebe4562dec ("Btrfs: rework qgroup accounting") reworked
qgroups and added some new structures. Another rework of qgroup
mechanics e69bcee37692 ("btrfs: qgroup: Cleanup the old
ref_node-oriented mechanism.") stopped using them and left uncleaned.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e902baac 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in btrfs_leaf_free_space

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# bcdc428c 19-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in btrfs_exclude_logged_extents

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8f881e8c 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from eb in leaf_data_end

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 80fbc341 19-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: Make btrfs_(set|clear)_header_flag return void

From the introduction of btrfs_(set|clear)_header_flag, there is no
usage of its return value. So just make it return void.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afe1a715 07-Mar-2019 Rasmus Villemoes <linux@rasmusvillemoes.dk>

btrfs: implement btrfs_debug* in terms of helper macro

First, the btrfs_debug macros open-code (one possible definition of)
DYNAMIC_DEBUG_BRANCH, so they don't benefit from the CONFIG_JUMP_LABEL
optimization.

Second, a planned change of struct _ddebug (to reduce its size on 64 bit
machines) requires that all descriptors in a translation unit use
distinct identifiers.

Using the new _dynamic_func_call_no_desc helper macro from
dynamic_debug.h takes care of both of these. No functional change.

Link: http://lkml.kernel.org/r/20190212214150.4807-12-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 78c52d9e 06-Feb-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: check for refs on snapshot delete resume

There's a bug in snapshot deletion where we won't update the
drop_progress key if we're in the UPDATE_BACKREF stage. This is a
problem because we could drop refs for blocks we know don't belong to
ours. If we crash or umount at the right time we could experience
messages such as the following when snapshot deletion resumes

BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258 owner 1 offset 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
CPU: 3 PID: 16052 Comm: umount Tainted: G W OE 5.0.0-rc4+ #147
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
FS: 00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
Call Trace:
? _raw_spin_unlock+0x27/0x40
? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
__btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
? join_transaction+0x2b/0x460 [btrfs]
btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
btrfs_commit_transaction+0x52/0xa50 [btrfs]
? start_transaction+0xa6/0x510 [btrfs]
btrfs_sync_fs+0x79/0x1c0 [btrfs]
sync_filesystem+0x70/0x90
generic_shutdown_super+0x27/0x120
kill_anon_super+0x12/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x43/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x8b/0xc0
exit_to_usermode_loop+0xce/0xd0
do_syscall_64+0x20b/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe

To fix this simply mark dead roots we read from disk as DEAD and then
set the walk_control->restarted flag so we know we have a restarted
deletion. From here whenever we try to drop refs for blocks we check to
verify our ref is set on them, and if it is not we skip it. Once we
find a ref that is set we unset walk_control->restarted since the tree
should be in a normal state from then on, and any problems we run into
from there are different issues. I tested this with an existing broken
fs and my reproducer that creates a broken fs and it fixed both file
systems.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0ea82076 12-Feb-2019 David Sterba <dsterba@suse.com>

btrfs: scrub: remove unused nocow worker pointer

The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
optimization was removed from scrub in 9bebe665c3e4 ("btrfs: scrub:
Remove unused copy_nocow_pages and its callchain").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c8352942 12-Feb-2019 David Sterba <dsterba@suse.com>

btrfs: scrub: add assertions for worker pointers

The scrub worker pointers are not NULL iff the scrub is running, so
reset them back once the last reference is dropped. Add assertions to
the initial phase of scrub to verify that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ff09c4ca 29-Jan-2019 Anand Jain <anand.jain@oracle.com>

btrfs: scrub: convert scrub_workers_refcnt to refcount_t

Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
we get the extra checks. All reference changes are still done under
scrub_lock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 450114fc 21-Nov-2018 Josef Bacik <josef@toxicpanda.com>

btrfs: don't use global reserve for chunk allocation

We've done this forever because of the voodoo around knowing how much
space we have. However, we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily. Instead use the actual used amount when
determining if we need to allocate a chunk or not.

This has a side effect for mixed block group fs'es where we are no
longer allocating enough chunks for the data/metadata requirements. To
deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
machine. This will only get used if we've already made a full loop
through the flushing machinery and tried committing the transaction.

If we have then we can try and force a chunk allocation since we likely
need it to make progress. This resolves issues I was seeing with
the mixed bg tests in xfstests without the new flushing state.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 034f784d 03-Dec-2018 Josef Bacik <josef@toxicpanda.com>

btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput. There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue. This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

The waiting is killable as it could be indirectly called from user
operations like fallocate or zero-range. Such call sites should handle
the error but otherwise it's not necessary. Eg. flush_space just needs
to attempt to make space by waiting on iputs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add killable comment and changelog parts ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 2eec5f00 29-Jan-2019 Anders Roxell <anders.roxell@linaro.org>

btrfs: let the assertion expression compile in all configs

A compiler warning (in a patch in development) pointed to a variable
that was used only inside and ASSERT:

u64 root_objectid = root->root_key.objectid;
ASSERT(root_objectid == ...);

fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
u64 root_objectid = root->root_key.objectid;
^~~~~~~~~~~~~

When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.

Rework the assertion helper by adding a runtime check instead of the
'#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
condition being passed into an inline function after preprocessing.

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 370a11b8 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Introduce per-root swapped blocks infrastructure

To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.

The designed workflow will be:

1) Record the subtree root block that gets swapped.

During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF

In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.

2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF

3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.

3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.

4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.

This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d2311e69 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots

Relocation code will drop btrfs_root::reloc_root as soon as
merge_reloc_root() finishes.

However later qgroup code will need to access btrfs_root::reloc_root
after merge_reloc_root() for delayed subtree rescan.

So alter the timming of resetting btrfs_root:::reloc_root, make it
happens after transaction commit.

With this patch, we will introduce a new btrfs_root::state,
BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
that although btrfs_root::reloc_tree is still non-NULL, but still it's
not used any more.

The lifespan of btrfs_root::reloc tree will become:
Old behavior | New
------------------------------------------------------------------------
btrfs_init_reloc_root() --- | btrfs_init_reloc_root() ---
set reloc_root | | set reloc_root |
| | |
| | |
merge_reloc_root() | | merge_reloc_root() |
|- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+-
clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE |
| record root into dirty |
| roots rbtree |
| |
| reloc_block_group() Or |
| btrfs_recover_relocation() |
| | After transaction commit |
| |- clean_dirty_subvols() ---
| clear btrfs_root::reloc_root

During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
btrfs_root::reloc_tree should be qgroup.

Since reloc root needs a longer life-span, this patch will also delay
btrfs_drop_snapshot() call.
Now btrfs_drop_snapshot() is called in clean_dirty_subvols().

This patch will increase the size of btrfs_root by 16 bytes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ab47a8d 12-Dec-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused arguments from btrfs_get_extent_fiemap

This function is a simple wrapper over btrfs_get_extent that returns
either:

a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.

To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc9a8bf7 19-Dec-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Make first argument of btrfs_run_delalloc_range directly an inode

Since this function is no longer a callback there is no need to have
its first argument obfuscated with a void *. Change it directly to a
pointer to an inode. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd340d0f 11-Jan-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: wakeup cleaner thread when adding delayed iput

The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path. Delaying iputs
means we are potentially delaying the eviction of an inode and it's
respective space. The cleaner thread only gets woken up every 30
seconds, or when we require space. If there are a lot of inodes that
need to be deleted we could induce a serious amount of latency while we
wait for these inodes to be evicted. So instead wakeup the cleaner if
it's not already awake to process any new delayed iputs we add to the
list. If we suddenly need space we will less likely be backed up
behind a bunch of inodes that are waiting to be deleted, and we could
possibly free space before we need to get into the flushing logic which
will save us some latency.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 31890da0 21-Nov-2018 Josef Bacik <josef@toxicpanda.com>

btrfs: handle delayed ref head accounting cleanup in abort

We weren't doing any of the accounting cleanup when we aborted
transactions. Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

The test generic/475 on a 2G VM can trigger the problems eg.:

[ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
[ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
[ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
[ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
[ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
[ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
[ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
[ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
[ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
[ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
[ 8502.170296] FS: 00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
[ 8502.172439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
[ 8502.175094] Call Trace:
[ 8502.175759] close_ctree+0x17f/0x350 [btrfs]
[ 8502.176721] generic_shutdown_super+0x64/0x100
[ 8502.177702] kill_anon_super+0x14/0x30
[ 8502.178607] btrfs_kill_super+0x12/0xa0 [btrfs]
[ 8502.179602] deactivate_locked_super+0x29/0x60
[ 8502.180595] cleanup_mnt+0x3b/0x70
[ 8502.181406] task_work_run+0x98/0xc0
[ 8502.182255] exit_to_usermode_loop+0x83/0x90
[ 8502.183113] do_syscall_64+0x15b/0x180
[ 8502.183919] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Corresponding to

release_global_block_rsv() {
...
WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);

CC: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add log dump ]
Signed-off-by: David Sterba <dsterba@suse.com>


# a65001e8 10-Dec-2018 Al Viro <viro@zeniv.linux.org.uk>

btrfs: sanitize security_mnt_opts use

1) keeping a copy in btrfs_fs_info is completely pointless - we never
use it for anything. Getting rid of that allows for simpler calling
conventions for setup_security_options() (caller is responsible for
freeing mnt_opts in all cases).

2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are*
close (in fact, selinux sb_set_mnt_opts() ought to punt to
sb_remount() in "already initialized" case), but let's handle
that uniformly. And the only reason why the original btrfs changes
didn't go for security_sb_remount() in btrfs_remount() case is that
it hadn't been exported. Let's export it for a while - it'll be
going away soon anyway.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 83354f07 30-Nov-2018 Josef Bacik <jbacik@fb.com>

btrfs: catch cow on deleting snapshots

When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug. This was indeed what was happening, and this patch helped me
verify my theory. It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 64403612 03-Dec-2018 Josef Bacik <josef@toxicpanda.com>

btrfs: rework btrfs_check_space_for_delayed_refs

Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need. This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv. If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 413df725 03-Dec-2018 Josef Bacik <josef@toxicpanda.com>

btrfs: add new flushing states for the delayed refs rsv

A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure. Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba2c4d4e 03-Dec-2018 Josef Bacik <jbacik@fb.com>

btrfs: introduce delayed_refs_rsv

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Historical background:

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.

Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 53176dde 04-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: dev-replace: remove custom read/write blocking scheme

After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.

The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.

Signed-off-by: David Sterba <dsterba@suse.com>


# 129827e3 04-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: dev-replace: swich locking to rw semaphore

This is the first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.

The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.

This patch only switches the locking primitive, for better
bisectability. There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.

Signed-off-by: David Sterba <dsterba@suse.com>


# bbe339cc 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: drop extra enum initialization where using defaults

The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 61fa90c1 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: switch BTRFS_ROOT_* to enums

We can use simple enum for values that are not part of on-disk format:
root tree flags.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb1a524c 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: switch BTRFS_FS_* to enums

We can use simple enum for values that are not part of on-disk format:
internal filesystem states.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 688a75b9 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: switch BTRFS_BLOCK_RSV_* to enums

We can use simple enum for values that are not part of on-disk format:
block reserve types.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# b00146b5 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: switch BTRFS_FS_STATE_* to enums

We can use simple enum for values that are not part of on-disk format:
global filesystem states.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# da12fe54 27-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Refactor btrfs_merge_bio_hook

This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# de37aa51 30-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fsid/metadata_fsid fields from btrfs_info

Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7239ff4b 30-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Introduce support for FSID change without metadata rewrite

This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.

When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# f8f591df 19-Nov-2018 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: introduce EXPORT_FOR_TESTS macro

Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
functions are either local to the file they are implemented in and thus
should be declared static or are called from within the test
implementation defined in a different file.

Introduce an EXPORT_FOR_TESTS macro which depending on
CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
function or not.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3cd24c69 01-Nov-2018 Ethan Lien <ethanlien@synology.com>

btrfs: use tagged writepage to mitigate livelock of snapshot

Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.

This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.

We do a simple snapshot speed test on a Intel D-1531 box:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 1m58sec
patched: 6.54sec

This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.

For a multi writers, random write test:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 15.83sec
patched: 10.35sec

The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c629732d 08-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered

This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eede2bf3 03-Nov-2016 Omar Sandoval <osandov@fb.com>

Btrfs: prevent ioctls from interfering with a swap file

A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag

Don't allow those to happen on an active swap file.

Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# abbb55f4 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::split_extent_hook callback

This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c848198 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::merge_extent_hook callback

This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a36bb5f9 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::clear_bit_hook callback

This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e06a1fc9 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::set_bit_hook extent_io callback

This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.

This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7087a9d8 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::writepage_end_io_hook

This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d75855b4 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::writepage_start_hook

This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5eaad97a 01-Nov-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove extent_io_ops::fill_delalloc

This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4222ea71 24-Oct-2018 Filipe Manana <fdmanana@suse.com>

Btrfs: fix deadlock on tree root leaf when finding free extent

When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

schedule+0x28/0x80
btrfs_tree_read_lock+0x8e/0x120 [btrfs]
? finish_wait+0x80/0x80
btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
btrfs_search_slot+0xf6/0x9f0 [btrfs]
? evict_refill_and_join+0xd0/0xd0 [btrfs]
? inode_insert5+0x119/0x190
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_iget+0x113/0x690 [btrfs]
__lookup_free_space_inode+0xd8/0x150 [btrfs]
lookup_free_space_inode+0x5b/0xb0 [btrfs]
load_free_space_cache+0x7c/0x170 [btrfs]
? cache_block_group+0x72/0x3b0 [btrfs]
cache_block_group+0x1b3/0x3b0 [btrfs]
? finish_wait+0x80/0x80
find_free_extent+0x799/0x1010 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
__btrfs_cow_block+0x11d/0x500 [btrfs]
btrfs_cow_block+0xdc/0x180 [btrfs]
btrfs_search_slot+0x3bd/0x9f0 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_update_inode_item+0x46/0x100 [btrfs]
cache_save_setup+0xe4/0x3a0 [btrfs]
btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by using the tree root's commit root when searching for a
block group's free space cache inode item when we are attempting to load
a free space cache. This is safe since block groups once loaded stay in
memory forever, as well as their caches, so after they are first loaded
we will never need to read their inode items again. For new block groups,
once they are created they get their ->cached field set to
BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.

Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42ec3d4c 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: make remap_file_range functions take and return bytes completed

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 2e5dfc99 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: combine the clone and dedupe into a single remap_file_range

Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 7c861627 10-Oct-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: remove fs_info from btrfs_should_throttle_delayed_refs

The avg_delayed_ref_runtime can be referenced from the transaction
handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af9b8a0e 10-Oct-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: remove fs_info from btrfs_check_space_for_delayed_refs

It can be referenced from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 52398340 21-Aug-2018 Liu Bo <bo.liu@linux.alibaba.com>

Btrfs: kill btrfs_clear_path_blocking

Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
-d $MNT/0 -d $MNT/1 \
-d $MNT/2 -d $MNT/3 \
-d $MNT/4 -d $MNT/5 \
-d $MNT/6 -d $MNT/7 \
-d $MNT/8 -d $MNT/9 \
-d $MNT/10 -d $MNT/11 \
-d $MNT/12 -d $MNT/13 \
-d $MNT/14 -d $MNT/15

w/o patch:
real 2m27.307s
user 0m12.839s
sys 13m42.831s

w/ patch:
real 1m2.273s
user 0m15.802s
sys 8m16.495s

1.1) latency histogram from funclatency[1]

Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.

--------------------------------------------
w/o patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 2385222 |****************************************|
2 -> 3 : 37147 | |
4 -> 7 : 20452 | |
8 -> 15 : 13131 | |
16 -> 31 : 3877 | |
32 -> 63 : 3900 | |
64 -> 127 : 2612 | |
128 -> 255 : 974 | |
256 -> 511 : 165 | |
512 -> 1023 : 13 | |

Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6743860 |****************************************|
2 -> 3 : 2146 | |
4 -> 7 : 190 | |
8 -> 15 : 38 | |
16 -> 31 : 4 | |

--------------------------------------------
w/ patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 1318454 |****************************************|
2 -> 3 : 6800 | |
4 -> 7 : 3664 | |
8 -> 15 : 2145 | |
16 -> 31 : 809 | |
32 -> 63 : 219 | |
64 -> 127 : 10 | |

Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6854317 |****************************************|
2 -> 3 : 2383 | |
4 -> 7 : 601 | |
8 -> 15 : 92 | |

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.

[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7f8d236a 04-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: dev-replace: move replace members out of fs_info

The replace_wait and bio_counter were mistakenly added to fs_info in
commit c404e0dc2c843b154f ("Btrfs: fix use-after-free in the finishing
procedure of the device replace"), but they logically belong to
fs_info::dev_replace. Besides, bio_counter is a very generic name and is
confusing in bare fs_info context.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3280f874 24-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: remove btrfs_dev_replace::read_locks

This member seems to be copied from the extent_buffer locking scheme and
is at least used to assert that the read lock/unlock is properly nested.
In some way. While the _inc/_dec are called inside the read lock
section, the asserts are both inside and outside, so the ordering is not
guaranteed and we can see read/inc/dec ordered in any way
(theoretically).

A missing call of btrfs_dev_replace_clear_lock_blocking could cause
unexpected read_locks count, so this at least looks like a valid
assertion, but this will become unnecessary with later updates.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b2fa1154 17-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: tests: polish ifdefs around testing helper

Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.

Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.

As this is only for the testing code, drop unlikely().

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a654666a 17-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: tests: group declarations of self-test helpers

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 57ec5fb4 17-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: tests: move testing members of struct btrfs_root to the end

The data used only for tests are better placed at the end of the
structure so that they don't change the structure layout. All new
members of btrfs_root should be placed before.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9c36396c 17-Aug-2018 David Sterba <dsterba@suse.com>

btrfs: tests: add separate stub for find_lock_delalloc_range

The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# de2c6615 24-Aug-2018 Liu Bo <bo.liu@linux.alibaba.com>

Btrfs: fix alignment in declaration and prototype of btrfs_get_extent

This fixes btrfs_get_extent to be consistent with our existing
declaration style.

Note: For the record, indentation styles that are accepted are both,
aligning under the opening ( and tab or double tab indentation on the
next line. Preferrably not spliting the type or long expressions in the
argument lists.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 4fd786e6 05-Aug-2018 Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: Remove 'objectid' member from struct btrfs_root

There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.

They are both set to the same value in __setup_root():

static void __setup_root(struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
...
root->objectid = objectid;
...
root->root_key.objectid = objecitd;
...
}

and not changed to other value after initialization.

grep in btrfs directory shows both are used in many places:
$ grep -rI "root->root_key.objectid" | wc -l
133
$ grep -rI "root->objectid" | wc -l
55
(4.17, inc. some noise)

It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.

Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 684572df 04-Aug-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove root parameter from btrfs_insert_dir_item

All callers pass the root tree of dir, we can push that down to the
function itself.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3a584174 04-Aug-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes

Using true and false here is closer to the expected semantic than using
0 and 1. No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b6fdfbff 23-Aug-2018 Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu

Commit 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.

It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.

Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.

Fixes: 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ecebf4d 05-Aug-2018 Robbie Ko <robbieko@synology.com>

Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space

Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.

The steps leading to this problem are:

1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.

2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).

3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.

So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:

1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.

2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).

3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.

Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 85c39548 25-Jul-2018 Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: extent-tree: Remove unused __btrfs_free_block_rsv

There is no user of this function anymore.

This was forgotten to be removed in commit a575ceeb1338
("Btrfs: get rid of unused orphan infrastructure").

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6025c19f 31-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove fs_info from btrfs_add_root_ref

It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3ee1c553 31-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove fs_info from btrfs_del_root_ref

It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab9ce7d4 31-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove fs_info from btrfs_del_root

It can be referenced from the passed transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ffad70e 20-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: constify strings passed to assertion helper

Signed-off-by: David Sterba <dsterba@suse.com>


# e9539cff 20-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: dev-replace: remove unused members of btrfs_dev_replace

Lock owner and nesting level have been unused since day 1, probably
copy&pasted from the extent_buffer locking scheme without much thinking.
The locking of device replace is simpler and does not need any lock
nesting.

Signed-off-by: David Sterba <dsterba@suse.com>


# e17385ca 20-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: remove unused member btrfs_root::name

Added in 58176a9604c ("Btrfs: Add per-root block accounting and sysfs
entries") in 2007, the roots had names exported in sysfs. The code
was commented out in 4df27c4d5cc1dda54ed ("Btrfs: change how subvolumes
are organized") and cleaned by 182608c8294b5fe9 ("btrfs: remove old
unused commented out code").

Signed-off-by: David Sterba <dsterba@suse.com>


# 5cdc84bf 18-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: drop extent_io_ops::set_range_writeback callback

The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 031f24da 22-May-2018 Qu Wenruo <wqu@suse.com>

btrfs: Use btrfs_mark_bg_unused to replace open code

Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dec59fa3 13-Jul-2018 Ethan Lien <ethanlien@synology.com>

btrfs: use customized batch size for total_bytes_pinned

In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg

total_bytes_pinned_percpu: 82 82
0.21 0.61 29.46 0.36 298340
635973 0.09 11.01 173476.25 0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg

total_bytes_pinned_percpu: 1 1
0.62 0.62 0.62 0.62 13601
31542 0.14 9.61 11016.90 0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba3c2b19 26-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Add graceful handling of V0 extents

Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.

In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a79865c6 21-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove V0 extent support

The v0 compat code was introduced in commit 5d4f98a28c7d
("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.

This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e41ca589 06-Jun-2018 Qu Wenruo <wqu@suse.com>

btrfs: Get rid of the confusing btrfs_file_extent_inline_len

We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43a7e99d 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_force_chunk_alloc

It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c83488af 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_inc_block_group_ro

It can be referenced from the passed bg cache.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 61da2abf 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_alloc_logged_file_extent

It can be referenced from trans since the function is always called
within a valid transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 451a2c13 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from check_system_chunk

It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a98ec01 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_remove_block_group

This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e7e02096 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_make_block_group

This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a944442c 12-Jun-2018 Allen Pais <allen.lkml@gmail.com>

btrfs: replace get_seconds with new 64bit time API

The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 87eb5eb2 06-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: dedupe: rationalize args

Clean up f_op->dedupe_file_range() interface.

1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>


# 5740c99e 06-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: dedupe: return int

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a528a241 06-Jun-2018 Souptick Joarder <jrdr.linux@gmail.com>

btrfs: change return type of btrfs_page_mkwrite to vm_fault_t

Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Reference commit 1c8f422059ae ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduced inline function in 4.17-rc6.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4c129db 29-May-2018 Gu JinXiang <gujx@cn.fujitsu.com>

btrfs: drop unused parameter qgroup_reserved

Since commit 7775c8184ec0 ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata. So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d1957791 29-May-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove fs_info argument from btrfs_uuid_tree_rem

This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>


# cdb345a8 29-May-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: Remove fs_info argument from btrfs_uuid_tree_add

This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a575ceeb 11-May-2018 Omar Sandoval <osandov@fb.com>

Btrfs: get rid of unused orphan infrastructure

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7b6a221e 26-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_update_iflags to reflect which flags it touches

The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.

Signed-off-by: David Sterba <dsterba@suse.com>


# 20a68004 27-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Unexport and rename btrfs_invalidate_inodes

This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 110a21fe 26-Feb-2018 David Sterba <dsterba@suse.com>

btrfs: introduce conditional wakeup helpers

Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4457c1c7 10-May-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info argument from add_new_free_space

This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3a2f8c07 24-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Unexport btrfs_alloc_delalloc_work

It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 076da91c 23-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove delayed_iput member from btrfs_delalloc_work

When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 76f32e24 23-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes

It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 82b3e53b 23-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots

This parameter was introduced alongside the function in
eb73c1b7cea7 ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da43 ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 008ef096 20-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: drop lock parameter from update_ioctl_balance_args and rename

The parameter controls locking of the stats part but we can lock it
unconditionally, as this only happens once when balance starts. This is
not performance critical.

Add the prefix for an exported function.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3009a62f 20-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: track running balance in a simpler way

Currently fs_info::balance_running is 0 or 1 and does not use the
semantics of atomics. The pause and cancel check for 0, that can happen
only after __btrfs_balance exits for whatever reason.

Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
times but will block on the balance_mutex that protects the
fs_info::flags bit.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dccdb07b 20-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: kill btrfs_fs_info::volume_mutex

Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.

The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.

But we don't really need to track the same information in two ways.

1) After the previous cleanups, the main ioctl handlers for
add/del/resize copy the EXCL_OP bit next to the volume mutex, here
it's clearly safe.

2) Resuming balance during mount or after rw remount will set only the
EXCL_OP bit and the volume_mutex is held in the kernel thread that
calls btrfs_balance.

3) Resuming device replace during mount or after rw remount is done
after balance and is excluded by the EXCL_OP bit. It does not take
the volume_mutex at all and completely relies on the EXCL_OP bit.

4) The resuming of balance and dev-replace cannot hapen at the same time
as the ioctls cannot be started in parallel. Nevertheless, a crafted
image could trigger that and a warning is printed.

5) Balance is normally excluded by EXCL_OP and also uses own mutex to
protect against concurrent access to its status data. There's some
trickery to maintain the right lock nesting in case we need to
reexamine the status in btrfs_ioctl_balance. The volume_mutex is
removed and the unlock/lock sequence is left in place as we might
expect other waiters to proceed.

6) Similar to 5, the unlock/lock sequence is kept in
btrfs_cancel_balance to allow waiters to continue.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 40012f96 19-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove btrfs_wait_and_free_delalloc_work

This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f60a2364 17-Apr-2018 Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy()

Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:

1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes

and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.

Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.

As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ec42f167 17-Apr-2018 Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: Move may_destroy_subvol() from ioctl.c to inode.c

This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>


# c065f5b1 02-Apr-2018 Su Yue <suy.fnst@cn.fujitsu.com>

btrfs: rename btrfs_get_block_group_info and make it static

The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting
barrier fails") which used it in disk-io.c.

However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.

So, make it static and rename it to be original name
get_block_group_info.

No functional change.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b877331 26-Apr-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Split btrfs_del_delalloc_inode into 2 functions

This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ff6bc37e 20-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Use independent and accurate per inode qgroup rsv

Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.

Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.

So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.

This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>


# a514d638 22-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT

Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.

Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit. (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9888c340 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- headers

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8287475a 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space

For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no bytes were really reserved, while at release
time, quota was enabled so we will try to release some bytes we didn't
really own.

Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.

To address this, record qgroup meta reserved bytes into
root::qgroup_meta_rsv_pertrans and ::prealloc.
So at releasing time we won't free any bytes we didn't reserve.

For DATA, it's already handled by io_tree, so nothing needs to be done
there.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43b18595 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Use separate meta reservation type for delalloc

Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e1211d0e 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup

Since qgroup has seperate metadata reservation types now, we can
completely get rid of the old root->qgroup_meta_rsv, which mostly acts
as current META_PERTRANS reservation type.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4408ea7c 20-Mar-2018 Misono, Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: ctree.h: Fix wrong comment position about csum size

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 75cb379d 20-Mar-2018 Jeff Mahoney <jeffm@suse.com>

btrfs: defer adding raid type kobject until after chunk relocation

Any time the first block group of a new type is created, we add a new
kobject to sysfs to hold the attributes for that type. Kobject-internal
allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
While it appears as if this can occur any time a block group is created,
the only times the first block group of a new type can be created in
memory is at mount and when we create the first new block group during
raid conversion.

This patch adds a new list to track pending kobject additions and then
handles them after we do chunk relocation. Between relocating the
target chunk (or forcing allocation of a new chunk in the case of data)
and removing the old chunk, we're in a safe place for fs-reclaim to
occur. We're holding the volume mutex, which is already held across
page faults, and the delete_unused_bgs_mutex, which will only stall
the cleaner thread.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dc2d3005 20-Mar-2018 Jeff Mahoney <jeffm@suse.com>

btrfs: remove dead create_space_info calls

Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've
separated out the creation and updating of the space info structures.
That commit was a straightforward refactoring of the two parts of
update_space_info, but we can go a step further. Since commits
c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and
b742bb82f (Btrfs: Link block groups of different raid types), we know
that the space_info structures will be created at mount and there will
only ever be, at most, three of them.

This patch cleans out the create_space_info calls after __find_space_info
returns NULL since __find_space_info *can't* return NULL.

The initial cause for reviewing this was the kobject_add calls from
create_space_info occuring in sites where fs-reclaim wasn't allowed. Now
we are certain they occur only early in the mount process and are safe.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5ead2dd0 15-Mar-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Drop fs_info parameter from btrfs_finish_extent_commit

It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c79a70b1 15-Mar-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: drop fs_info parameter from btrfs_run_delayed_refs

It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 92e2f7e3 05-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove btrfs_fs_info::open_ioctl_trans

Since userspace transaction have been removed we no longer have use
for this field so delete it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 859e682d 05-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove btrfs_file_private::trans

Now that the userspace transaction IOCTL have been removed, this member
is no longer used so just remove it

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7a5a07a8 05-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove userspace transaction ioctls

Commit 3558d4f88ec8 ("btrfs: Deprecate userspace transaction ioctls")
marked the beginning of the end of userspace transaction. This commit
finishes the job! There are no known users and ceph does not use the
ioctl anymore.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Sage Weil <sage@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7c829b72 07-Mar-2018 Anand Jain <anand.jain@oracle.com>

btrfs: add define for oldest generation

Some functions can filter metadata by the generation. Add a define that
will annotate such arguments.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 738c93d4 27-Feb-2018 David Sterba <dsterba@suse.com>

btrfs: move btrfs_listxattr prototype to xattr.h

There's a proper header for xattr handlers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d612ac59 26-Feb-2018 Anand Jain <anand.jain@oracle.com>

btrfs: unify types for metadata_ratio and data_chunk_allocations

We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e67c718b 19-Feb-2018 David Sterba <dsterba@suse.com>

btrfs: add more __cold annotations

The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>


# 9678c543 08-Jan-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove custom crc32c init code

The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.

Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:

1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.

I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.

Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.

The modinfo confirms that now all the module dependencies are there:

before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 3e72ee88 30-Jan-2018 Qu Wenruo <wqu@suse.com>

btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.

Although the name is a little longer, but it should explain its usage
quite well.

2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.

This also makes later code re-use between kernel and btrfs-progs
easier.

3) Remove function get_block_group_index()
Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5d23515b 31-Jan-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable

Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.

This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d3740608 13-Feb-2018 Anand Jain <anand.jain@oracle.com>

btrfs: manage commit mount option as %u

As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7b885be 13-Feb-2018 Anand Jain <anand.jain@oracle.com>

btrfs: manage thread_pool mount option as %u

The mount option thread_pool is always unsigned. Manage it that way all
around.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 21217054 07-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Don't pass fs_info arg to btrfs_start_dirty_block_groups

It can be referenced from the passed transaction so no point in passing
it as a function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6c686b35 07-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info argument from btrfs_create_pending_block_groups

It can be referenced from the passed transaciton so no point in
passing it as function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0e34693f 07-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_trans_release_metadata private to transaction.c

This function is only ever used in __btrfs_end_transaction and
btrfs_commit_transaction so there is no need to export it via header.
Let's move it closer to where it's used, make it static and remove it
from the header. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1f250e92 28-Feb-2018 Filipe Manana <fdmanana@suse.com>

Btrfs: fix log replay failure after unlink and link combination

If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.

Example:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt

$ mkdir /mnt/testdir
$ touch /mnt/testdir/foo
$ ln /mnt/testdir/foo /mnt/testdir/bar

$ sync

$ unlink /mnt/testdir/bar
$ touch /mnt/testdir/bar
$ xfs_io -c "fsync" /mnt/testdir/bar

<power failure>

$ mount /dev/sdb /mnt
mount: mount(2) failed: /mnt: No such file or directory

When replaying the log, for that example, we also see the following in
dmesg/syslog:

[71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
[71813.674204] ------------[ cut here ]------------
[71813.675694] BTRFS: Transaction aborted (error -2)
[71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
[71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
[71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
[71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
[71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
[71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
[71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
[71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
[71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
[71813.679669] Call Trace:
[71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
[71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
[71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
[71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
[71813.679669] ? rcu_read_unlock+0x3a/0x57
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
[71813.679669] walk_log_tree+0xad/0x188 [btrfs]
[71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
[71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
[71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
[71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] do_mount+0x6e5/0x973
[71813.679669] ? memdup_user+0x3e/0x5c
[71813.679669] SyS_mount+0x72/0x98
[71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
[71813.679669] RIP: 0033:0x7f7cedf150ba
[71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
[71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
[71813.679669] ---[ end trace 83bd473fc5b4663b ]---
[71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
[71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
[71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
[71814.128078] BTRFS error (device dm-0): open_ctree failed

This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.

Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a8fd1f71 15-Feb-2018 Jeff Mahoney <jeffm@suse.com>

btrfs: use kvzalloc to allocate btrfs_fs_info

The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.

There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.

As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c04e61b5 05-Jan-2018 Liu Bo <bo.li.liu@oracle.com>

Btrfs: move extent map specific code to extent_map.c

These helpers are extent map specific, move them to extent_map.c.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7b4df058 05-Jan-2018 Liu Bo <bo.li.liu@oracle.com>

Btrfs: add helper for em merge logic

This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 203e02d9 22-Dec-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: remove unused wait in btrfs_stripe_hash

In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bae15d95 07-Nov-2017 Qu Wenruo <wqu@suse.com>

btrfs: Cleanup existing name_len checks

Since tree-checker has verified leaf when reading from disk, we don't
need the existing verify_dir_item() or btrfs_is_name_len_valid() checks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f5c29bd9 02-Nov-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: add __init macro to btrfs init functions

Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"

SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e3b8a485 03-Nov-2017 Filipe Manana <fdmanana@suse.com>

Btrfs: fix reported number of inode blocks after buffered append writes

The patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:

$ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
$ du -h foobar
0 foobar
$ sync
$ du -h foobar
64K foobar

The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.

This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().

This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:

$ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt

Fixes: a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 69fe2d75 19-Oct-2017 Josef Bacik <josef@toxicpanda.com>

btrfs: make the delalloc block rsv per inode

The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years. There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.

Fix this by making the delalloc block rsv per-inode. This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8b62f87b 19-Oct-2017 Josef Bacik <josef@toxicpanda.com>

Btrfs: rework outstanding_extents

Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this

btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1

for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f51d2b59 15-Sep-2017 David Sterba <dsterba@suse.com>

btrfs: allow to set compression level for zlib

Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib # default
$ mount -o compess=zlib0 # same
$ mount -o compess=zlib9 # level 9, slower sync, less data
$ mount -o compess=zlib1 # level 1, faster sync, more data
$ mount -o remount,compress=zlib3 # level set by remount

The compress-force works the same as compress'. The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba <dsterba@suse.com>


# d4417e22 16-Oct-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Replace opencoded sizes with their symbolic constants

Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fd708b81 29-Sep-2017 Josef Bacik <josef@toxicpanda.com>

Btrfs: add a extent ref verify tool

We were having corruption issues that were tied back to problems with
the extent tree. In order to track them down I built this tool to try
and find the culprit, which was pretty successful. If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid. I've run this through with
xfstests and haven't gotten any false positives. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 84f7d8e6 29-Sep-2017 Josef Bacik <josef@toxicpanda.com>

btrfs: pass root to various extent ref mod functions

We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead. This will be used in
a subsequent patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb592373 29-Sep-2017 Josef Bacik <josef@toxicpanda.com>

btrfs: add ref-verify mount option

This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 736cd52e 07-Sep-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: remove nr_async_submits and async_submit_draining

Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.

Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.

However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f851689b 07-Sep-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: remove nr_async_bios

This was intended to congest higher layers to not send bios, but as

1) the congested bit has been taken by writeback

Async bios come from buffered writes and DIO writes.

For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.

2) and no one is waiting for %nr_async_bios down to zero,

Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus. And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.

We can safely remove them now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ee8c494f 20-Aug-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused arguments from btrfs_changed_cb_t

btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 69ad5976 03-Oct-2017 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: fix overlap of fs_info::flags values

Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>


# c2faff79 30-Aug-2017 Misono, Tomohiro <misono.tomohiro@jp.fujitsu.com>

btrfs: remove BTRFS_FS_QUOTA_DISABLING flag

Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.

When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.

BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.

So, let's remove BTRFS_FS_QUOTA_DISABLING flag.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cd5447e 17-Aug-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: pass fs_info to btrfs_del_root instead of tree_root

btrfs_del_roots always uses the tree_root. Let's pass fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4335958d 18-Aug-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: remove BUG() in btrfs_extent_inline_ref_size

Now that btrfs_get_extent_inline_ref_type() can report if type is a
valid one and all callers can gracefully deal with that, we don't need
to crash here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 167ce953 18-Aug-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: add a helper to retrive extent inline ref type

An invalid value of extent inline ref type may be read from a
malicious image which may force btrfs to crash.

This adds a helper which does sanity check for the ref type, so we can
know if it's sane, return he type, otherwise return an error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimal tweak const types, causing warnings due to other cleanup patches ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 0174484d 27-Jul-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove chunk_objectid argument from btrfs_make_block_group

btrfs_make_block_group is always called with chunk_objectid set to
BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
change anytime soon, so let's remove the argument and decrease the cognitive
load when reading the code path. No functional change

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 583b7231 28-Jul-2017 Hans van Kranenburg <hans.van.kranenburg@mendix.com>

btrfs: Do not use data_alloc_cluster in ssd mode

This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

The SSD story...

In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

However, there's a number of problems with this approach.

First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

Changes made in the code

1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.

Backporting notes

Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 23b5ec74 24-Jul-2017 Josef Bacik <jbacik@fb.com>

btrfs: fix readdir deadlock with pagefault

Readdir does dir_emit while under the btree lock. dir_emit can trigger
the page fault which means we can deadlock. Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Thread A
readdir <holding tree lock>
dir_emit
<page fault>
down_read(mmap_sem)

Thread B
mmap write
down_write(mmap_sem)
page_mkwrite
wait_ordered_extents

Process C
finish_ordered_extent
insert_reserved_file_extent
try to lock leaf <hang>

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# d3c0bab5 21-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: remove trivial wrapper btrfs_force_ra

It's a simple call page_cache_sync_readahead, same arguments in the same
order.

Signed-off-by: David Sterba <dsterba@suse.com>


# 35dc3130 21-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: drop ancient page flag mappings

There's no PageFsMisc. Added by patch 4881ee5a2e995 in 2008, the flag is
not present in current kernels.

Signed-off-by: David Sterba <dsterba@suse.com>


# ea14b57f 21-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: fix spelling of snapshotting

Signed-off-by: David Sterba <dsterba@suse.com>


# 19aee8de 18-Jul-2017 Anand Jain <Anand.Jain@oracle.com>

btrfs: btrfs_inherit_iflags() can be static

btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc3cce23 08-Mar-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Cleanup num_tolerated_disk_barrier_failures

As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.

We can now remove it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cbb1f45 28-Jun-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: struct-funcs, constify readers

We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.

No impact on code, but serves as documentation that a buffer is intended
not to be modified.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 23d1f737 28-Jun-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: remove unused sectorsize member

The sectorsize member of btrfs_block_group_cache is unused. So remove it, this
reduces the number of holes in the struct.

With patch:
/* size: 856, cachelines: 14, members: 40 */
/* sum members: 837, holes: 4, sum holes: 19 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 24 bytes */

Without patch:
/* size: 864, cachelines: 14, members: 41 */
/* sum members: 841, holes: 5, sum holes: 23 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 32 bytes */

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c1aab1d 09-Aug-2017 Nick Terrell <terrelln@fb.com>

btrfs: Add zstd support

Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None | 0.99 | 504 | 686 |
| lzo | 1.66 | 398 | 442 |
| zlib | 2.58 | 65 | 241 |
| zstd 1 | 2.57 | 260 | 383 |
| zstd 3 | 2.71 | 174 | 408 |
| zstd 6 | 2.87 | 70 | 398 |
| zstd 9 | 2.92 | 43 | 406 |
| zstd 12 | 2.93 | 21 | 408 |
| zstd 15 | 3.01 | 11 | 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 |
| lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 |
| zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 |
| zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 |

[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# bc42bda2 27-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges

[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 364ecf36 27-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e79a3327 06-Jun-2017 Su Yue <suy.fnst@cn.fujitsu.com>

btrfs: Check name_len with boundary in verify dir_item

Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.

Example:
1. Corrupt one dir_item name_len to be 255.
2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[ 48.451449] BTRFS info (device vdb1): disk space caching is enabled
[ 48.451453] BTRFS info (device vdb1): has skinny extents
[ 48.489420] general protection fault: 0000 [#1] SMP
[ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 48.490008] Call Trace:
[ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[ 48.490008] iterate_dir+0x181/0x1b0
[ 48.490008] SyS_getdents+0xa7/0x150
[ 48.490008] ? fillonedir+0x150/0x150
[ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad
[ 48.490008] RIP: 0033:0x7fef5032546b
[ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[ 48.499455] ---[ end trace 321920d8e8339505 ]---

Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: David Sterba <dsterba@suse.com>


# 19c6dcbf 06-Jun-2017 Su Yue <suy.fnst@cn.fujitsu.com>

btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary

Introduce function btrfs_is_name_len_valid.

The function compares parameter @name_len with item boundary then
returns true if name_len is valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 7dfb8be1 16-Jun-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Round down values which are written for total_bytes_size

We got an internal report about a file system not wanting to mount
following 99e3ecfcb9f4 ("Btrfs: add more validation checks for
superblock").

BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
fs_devices total_rw_bytes 1000203820544

Subtracting the numbers we get a difference of less than a 4kb. Upon
closer inspection it became apparent that mkfs actually rounds down the
size of the device to a multiple of sector size. However, the same
cannot be said for various functions which modify the total size and are
called from btrfs_balance as well as when adding a new device. So this
patch ensures that values being saved into on-disk data structures are
always rounded down to a multiple of sectorsize.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eca152ed 16-Jun-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Manually implement device_total_bytes getter/setter

The device->total_bytes member needs to always be rounded down to sectorsize
so that it corresponds to the value of super->total_bytes. However, there are
multiple places where the setter is fed a value which is not rounded which
can cause a fs to be unmountable due to the check introduced in
99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). This patch
implements the getter/setter manually so that in a later patch I can add
necessary code to catch offenders.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0d0c71b3 14-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: obsolete and remove mount option alloc_start

The mount option alloc_start was used in the past for debugging and
stressing the chunk allocator. Not meant to be used by users, so we're
not breaking anybody's setup.

There was some added complexity handling changes of the value and when
it was not same as default. Such code has likely been untested and I
think it's better to remove it.

This patch kills all use of alloc_start, and by doing that also fixes
a bug when alloc_size is set, potentially called from statfs:

in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
protection is temporarily dropped so btrfs_account_dev_extents_size can
be called and then RCU is locked again! Doing that inside
list_for_each_entry_rcu is just asking for trouble, but unlikely to be
observed in practice.

Signed-off-by: David Sterba <dsterba@suse.com>


# fac03c8d 15-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: move fs_info::fs_frozen to the flags

We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4b5faeac 27-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: use generic slab for for btrfs_transaction

Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1. This matches the expectations of the transaction lifetime.

For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.

We'll lose the stats about the object use, which could be added later if
we really need them.

Signed-off-by: David Sterba <dsterba@suse.com>


# 118c701e 22-May-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: remove __BTRFS_LEAF_DATA_SIZE

__BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the
latter subsume the former.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d9ec8c4 29-May-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET

Commit 5f39d397dfbe ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 1b86826d 17-May-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: cleanup root usage by btrfs_get_alloc_profile

There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.

This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases. The next patch will eliminate one of those.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c6100a4b 05-May-2017 Josef Bacik <josef@toxicpanda.com>

Btrfs: replace tree->mapping with tree->private_data

For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>


# f29efe29 11-May-2017 Sargun Dhillon <sargun@sargun.me>

btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE

This patch introduces the quota override flag to btrfs_fs_info, and a
change to quota limit checking code to temporarily allow for quota to be
overridden for processes with CAP_SYS_RESOURCE.

It's useful for administrative programs, such as log rotation, that may
need to temporarily use more disk space in order to free up a greater
amount of overall disk space without yielding more disk space to the
rest of userland.

Eventually, we may want to add the idea of an operator-specific quota,
operator reserved space, or something else to allow for administrative
override, but this is perhaps the simplest solution.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>


# a5ed45f8 11-May-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Convert fs_info->free_chunk_space to atomic64_t

The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else. Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 70e7af24 02-Jun-2017 Omar Sandoval <osandov@fb.com>

Btrfs: fix delalloc accounting leak caused by u32 overflow

btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().

In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().

The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:

WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);

Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.

Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4e4cbee9 03-Jun-2017 Christoph Hellwig <hch@lst.de>

block: switch bios to blk_status_t

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9e11ceee 11-Apr-2017 Jan Kara <jack@suse.cz>

btrfs: Convert to separately allocated bdi

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0966a7b1 13-Apr-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: scrub: Introduce full stripe lock for RAID56

Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 171938e5 28-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: track exclusive filesystem operation in flags

There are several operations, usually started from ioctls, that cannot
run concurrently. The status is tracked in
mutually_exclusive_operation_running as an atomic_t. We can easily track
the status as one of the per-filesystem flag bits with same
synchronization guarantees.

The conversion replaces:

* atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...)
* atomic_set(..., 0) -> clear_bit(FLAG, ...)

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# de47c9d3 16-Mar-2017 Edmund Nadolski <enadolski@suse.com>

btrfs: replace hardcoded value with SEQ_LAST macro

Define the SEQ_LAST macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d48d71aa 02-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: remove redundant parameter from btree_readahead_hook

We can read fs_info from eb.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0700cea7 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert btrfs_root.refs from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e4f4714 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ce0dcee6 14-Mar-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: Change qgroup_meta_rsv to 64bit

Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.

This affects exclusive qgroups only.

TEST CASE:

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp

umount $SUBVOL
umount $MOUNTPOINT

mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL

btrfs quota rescan /mnt -w

for i in `seq 1 44000`; do
dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
if [[ $? > 0 ]]; then
btrfs qgroup show -pcref $SUBVOL
exit 1
fi
done

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# db0a669f 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_add_link take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fc4f21b1 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make get_extent_t take btrfs_inode

In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9cdc5124 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 73f2e545 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_orphan_add take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7a6d7067 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_mark_extent_written take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dcdbc059 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_drop_extent_cache take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6158e1ce 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 691fa059 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: all btrfs_delalloc_release_metadata take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f3db423 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 703b391a 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_orphan_release_metadata take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8ed7a2a0 20-Feb-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 04f4f916 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 70ddc553 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_is_free_space_inode take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e7611cf 20-Feb-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_insert_dir_item take btrfs_inode

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 11bac800 24-Feb-2017 Dave Jiang <dave.jiang@intel.com>

mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71367b3f 15-Feb-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: use btrfs_debug instead of pr_debug in transaction abort

Commit e5d6b12fe14 (Btrfs: don't WARN() in btrfs_transaction_abort() for
IO errors) added a pr_debug call to be printed when a transaction is
aborted with -EIO instead of WARN. btrfs_debug prints which file system
the message is associated with so let's use that instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5e00f193 15-Feb-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: convert btrfs_inc_block_group_ro to accept fs_info

btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree. Let's convert
to passing an fs_info and using the extent root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8b74c03e 10-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter from btrfs_prepare_extent_commit

Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7775c818 10-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused parameter from btrfs_subvolume_release_metadata

Unused since qgroup refactoring that split data and metadata accounting,
the btrfs_qgroup_free helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e4c3b2dc 30-Jan-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist

run_delalloc_nocow has used trans in two places where they don't
actually need @trans.

For btrfs_lookup_file_extent, we search for file extents without COWing
anything, and for btrfs_cross_ref_exist, the only place where we need
@trans is deferencing it in order to get running_transaction which we
could easily get from the global fs_info.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 310712b2 18-Jan-2017 Omar Sandoval <osandov@fb.com>

Btrfs: constify struct btrfs_{,disk_}key wherever possible

In a lot of places, it's unclear when it's safe to reuse a struct
btrfs_key after it has been passed to a helper function. Constify these
arguments wherever possible to make it obvious.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ec5934e 17-Jan-2017 Nikolay Borisov <n.borisov.lkml@gmail.com>

btrfs: Make btrfs_unlink_inode take btrfs_inode

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 823bb20a 04-Jan-2017 David Sterba <dsterba@suse.com>

btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE

The expression is open-coded in several places, this asks for a wrapper.
As we know the MAX_EXTENT fits to u32, we can use the appropirate
division helper. This cascades to the result type updates.

Compiler is clever enough to use shift instead of integer division, so
there's no change in the generated assembly.

Signed-off-by: David Sterba <dsterba@suse.com>


# a76b5b04 09-Dec-2016 Christoph Hellwig <hch@lst.de>

fs: try to clone files first in vfs_copy_file_range

A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way. Instead of duplicating
this logic move it to the VFS. Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch. NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# e5d6b12f 09-Dec-2016 Chris Mason <clm@fb.com>

Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors

btrfs_transaction_abort() has a WARN() to help us nail down whatever
problem lead to the abort. But most of the time, we're aborting for EIO,
and the warning just adds noise.

Signed-off-by: Chris Mason <clm@fb.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0b246afa 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, add fs_info convenience variables

In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 27965b6c 16-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da17066c 15-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: pull node/sector/stripe sizes out of root and into fs_info

We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f15376df 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, io_ctl_init

The io_ctl->root member was only being used to access root->fs_info.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c28f158e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: struct reada_control.root -> reada_control.fs_info

The root is never used. We substitute extent_root in for the
reada_find_extent call, since it's only ever used to obtain the node
size. This call site will be changed to use fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6bccf3ab 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: call functions that always use the same root with fs_info instead

There are many functions that are always called with the same root
argument. Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5b4aacef 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: call functions that overwrite their root parameter with fs_info

There are 11 functions that accept a root parameter and immediately
overwrite it. We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed0df618 01-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: store and load values of stripes_min/stripes_max in balance status item

The balance status item contains currently known filter values, but the
stripes filter was unintentionally not among them. This would mean, that
interrupted and automatically restarted balance does not apply the
stripe filters.

Fixes: dee32d0ac3719ef8d640efaf0884111df444730f
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>


# 2230adff 08-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: delete unused member from superblock

__bdev' has never been used since
0b86a832a1f38abec695864ec2eaedc9d2383f1b (2008).

Signed-off-by: David Sterba <dsterba@suse.com>


# fc2e901f 08-Nov-2016 David Sterba <dsterba@suse.com>

btrfs: reada, sink start parameter to btree_readahead_hook

Originally, the eb and start were passed separately in case eb is NULL.
Since the readahead has been refactored in 4.6, this is not true anymore
and we can get rid of the parameter.

Signed-off-by: David Sterba <dsterba@suse.com>


# 939659df 07-Nov-2016 Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

btrfs: add necessary comments about tickets_id

Tickets_id's name may result in some misunderstandings, it just indicates
the next ticket will be handled and is not stored per ticket.

Fixes: ce12965 ("btrfs: introduce tickets_id to determine whether
asynchronous metadata reclaim work makes progress")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cf8cddd3 27-Oct-2016 Christoph Hellwig <hch@lst.de>

btrfs: don't abuse REQ_OP_* flags for btrfs_map_block

btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6675df31 22-Sep-2016 Omar Sandoval <osandov@fb.com>

Btrfs: catch invalid free space trees

There are two separate issues that can lead to corrupted free space
trees.

1. The free space tree bitmaps had an endianness issue on big-endian
systems which is fixed by an earlier patch in this series.
2. btrfs-progs before v4.7.3 modified filesystems without updating the
free space tree.

To catch both of these issues at once, we need to force the free space
tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit.
If the bit isn't set, we know that it was either produced by a broken
big-endian kernel or may have been corrupted by btrfs-progs.

This also provides us with a way to add rudimentary read-write support
for the free space tree to btrfs-progs: it can just clear this bit and
have the kernel rebuild the free space tree.

Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2fd57fcb 23-Sep-2016 Arnd Bergmann <arnd@arndb.de>

btrfs: fix btrfs_no_printk stub helper

The addition of btrfs_no_printk() caused a build failure when
CONFIG_PRINTK is disabled:

fs/btrfs/send.c: In function 'send_rename':
fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration]

This moves the helper outside of that #ifdef so it is always
defined, and changes the existing #ifdef to refer to that
helper as well for consistency.

Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 851cd173 23-Sep-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: memset to avoid stale content in btree leaf

This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".

This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c01f5f96 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_debug should consume fs_info when DEBUG is not defined

We can hit unused variable warnings when btrfs_debug and friends are
just aliases for no_printk. This is due to the fs_info not getting
consumed by the function call, which can happen if convenenience
variables are used. This patch adds a new btrfs_no_printk static inline
that consumes the convenience variable and does nothing else. It
silences the unused variable warning and has no impact on the generated
code:

$ size fs/btrfs/extent_io.o*
text data bss dec hex filename
44072 152 32 44256 ace0 fs/btrfs/extent_io.o.btrfs_no_printk
44072 152 32 44256 ace0 fs/btrfs/extent_io.o.no_printk

Fixes: 27a0dd61a5 (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab8d0fc4 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert pr_* to btrfs_* where possible

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 62e85577 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert printk(KERN_* to use pr_* calls

This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# afcdd129 02-Sep-2016 Josef Bacik <jbacik@fb.com>

Btrfs: add a flags field to btrfs_fs_info

We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba8b04c1 19-Jul-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset

Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 897a41b1 31-Aug-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: add dynamic debug support

We can re-use the dynamic debugging descriptor to make use of the dynamic
debugging mechanism but still use our own printk interface.

Defining the DEBUG macro works as it did before. When it's defined,
all of the messages default to print. We can also enable all debug
messages at boot or module-load time using the 'dyndbg' and
'btrfs.dyndbg' options.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f0312210 15-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

btrfs: use filemap_check_errors()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Chris Mason <clm@fb.com>


# ce129655 01-Sep-2016 Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress

In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.

ticket = list_first_entry(&space_info->tickets,
struct reserve_ticket, list);
if (last_ticket == ticket) {
flush_state++;
} else {
last_ticket = ticket;
flush_state = FLUSH_DELAYED_ITEMS_NR;
if (commit_cycles)
commit_cycles--;
}

But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,

For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9e7cc91a 31-Jul-2016 Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

btrfs: fix fsfreeze hang caused by delayed iputs deal

When running fstests generic/068, sometimes we got below deadlock:
xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080
ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
Call Trace:
[<ffffffff816a9045>] schedule+0x35/0x80
[<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
[<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
[<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
[<ffffffff810d32b5>] percpu_down_read+0x35/0x50
[<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
[<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
[<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
[<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
[<ffffffff81230a1a>] evict+0xba/0x1a0
[<ffffffff812316b6>] iput+0x196/0x200
[<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
[<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
[<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
[<ffffffff81218040>] freeze_super+0xf0/0x190
[<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
[<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
[<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
[<ffffffff81229409>] SyS_ioctl+0x79/0x90
[<ffffffff81003c12>] do_syscall_64+0x62/0x110
[<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

CPU1 | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb); |
| |-> cleaner_kthread()
| | |-> btrfs_delete_unused_bgs()
| | |-> btrfs_remove_chunk()
| | |-> btrfs_remove_block_group()
| | |-> btrfs_add_delayed_iput()
| |
|-> sb->s_writers.frozen = SB_FREEZE_FS; |
|-> sb_wait_write(sb, SB_FREEZE_FS); |
| acquire SB_FREEZE_FS lock. |
| |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs() |
| will handle delayed iputs, |
| that means start_transaction() |
| will be called, which will try |
| to get SB_FREEZE_FS lock. |

To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to btrfs_freeze ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>


# 18513091 25-Jul-2016 Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

btrfs: update btrfs_space_info's bytes_may_use timely

This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d2c609b8 14-Aug-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: properly track when rescan worker is running

The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused. If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed. When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.

This patch uses a separate flag to indicate when the worker is
running. The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by; Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>


# 66642832 10-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_abort_transaction, drop root parameter

__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer. We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1db1ff92 15-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert nodesize macros to static inlines

This patch converts the macros used to calculate various node
size limits to static inlines. That way we get type checking for free.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 14a1e067 15-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: introduce BTRFS_MAX_ITEM_SIZE

We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in
several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the
same.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0c83b62e 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: cleanup, remove prototype for btrfs_find_root_ref

The function isn't implemented anywhere.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f5ee5c9a 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root

Now that we have a dummy fs_info associated with each test that
uses a root, we don't need the DUMMY_ROOT bit anymore. This lets
us make choices without needing an actual root like in e.g.
btrfs_find_create_tree_block.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7c0260ee 20-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: tests, require fs_info for root

This allows the upcoming patchset to push nodesize and sectorsize into
fs_info.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3cdde224 09-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_test_opt and friends should take a btrfs_fs_info

btrfs_test_opt and friends only use the root pointer to access
the fs_info. Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 05653ef3 15-Jul-2016 David Sterba <dsterba@suse.com>

btrfs: hide test-only member under ifdef

Signed-off-by: David Sterba <dsterba@suse.com>


# f376df2b 25-Mar-2016 Josef Bacik <jbacik@fb.com>

Btrfs: add tracepoints for flush events

We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 957780eb 17-May-2016 Josef Bacik <jbacik@fb.com>

Btrfs: introduce ticketed enospc infrastructure

Our enospc flushing sucks. It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out. So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation. This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations. Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread. This async flusher thread does the same old
flushing we always did, just asynchronously. As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks. These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 25d609f8 25-Mar-2016 Josef Bacik <jbacik@fb.com>

Btrfs: fix callers of btrfs_block_rsv_migrate

So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv. This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size. So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 31b9655f 11-Apr-2016 Josef Bacik <jbacik@fb.com>

Btrfs: track transid for delayed ref flushing

Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling. This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore. So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling. If the transaction we
get is different then we can just bail out. This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes). Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 81a75f67 05-Jun-2016 Mike Christie <mchristi@redhat.com>

btrfs: use bio fields for op and flags

The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 01327610 19-May-2016 Nicholas D Steeves <nsteeves@gmail.com>

btrfs: fix string and comment grammatical issues and typos

Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f78c436c 09-May-2016 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between block group relocation and nocow writes

Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

CPU 1 CPU 2

btrfs_relocate_block_group(bg X)

direct IO write starts against
an extent in block group X
using nocow mode (inode has the
nodatacow flag or the write is
for a prealloc extent)

btrfs_direct_IO()
btrfs_get_blocks_direct()
--> can_nocow_extent() returns 1

btrfs_inc_block_group_ro(bg X)
--> turns block group into RO mode

btrfs_wait_ordered_roots()
--> returns and does not know about
the DIO write happening at CPU 2
(the task there has not created
yet an ordered extent)

relocate_block_group(bg X)
--> rc->stage == MOVE_DATA_EXTENTS

find_next_extent()
--> returns extent that the DIO
write is going to write to

relocate_data_extent()

relocate_file_extent_cluster()

--> reads the extent from disk into
pages belonging to the relocation
inode and dirties them

--> creates DIO ordered extent

btrfs_submit_direct()
--> submits bio against a location
on disk obtained from an extent
map before the relocation started

btrfs_wait_ordered_range()
--> writes all the pages read before
to disk (belonging to the
relocation inode)

relocation finishes

bio completes and wrote new data
to the old location of the block
group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>


# 9cfa3e34 26-Apr-2016 Filipe Manana <fdmanana@suse.com>

Btrfs: don't do unnecessary delalloc flushes when relocating

Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

CPU 1 CPU 2

btrfs_relocate_block_group(bg X)

starts direct IO write,
target inode currently has no
ordered extents ongoing nor
dirty pages (delalloc regions),
therefore the root for our inode
is not in the list
fs_info->ordered_roots

btrfs_direct_IO()
__blockdev_direct_IO()
btrfs_get_blocks_direct()
btrfs_lock_extent_direct()
locks range in the io tree
btrfs_new_extent_direct()
btrfs_reserve_extent()
--> extent allocated
from bg X

btrfs_inc_block_group_ro(bg X)

btrfs_start_delalloc_roots()
__start_delalloc_inodes()
--> does nothing, no dealloc ranges
in the inode's io tree so the
inode's root is not in the list
fs_info->delalloc_roots

btrfs_wait_ordered_roots()
--> does not find the inode's root in the
list fs_info->ordered_roots

--> ends up not waiting for the direct IO
write started by the task at CPU 2

relocate_block_group(rc->stage ==
MOVE_DATA_EXTENTS)

prepare_to_relocate()
btrfs_commit_transaction()

iterates the extent tree, using its
commit root and moves extents into new
locations

btrfs_add_ordered_extent_dio()
--> now a ordered extent is
created and added to the
list root->ordered_extents
and the root added to the
list fs_info->ordered_roots
--> this is too late and the
task at CPU 1 already
started the relocation

btrfs_commit_transaction()

btrfs_finish_ordered_io()
btrfs_alloc_reserved_file_extent()
--> adds delayed data reference
for the extent allocated
from bg X

relocate_block_group(rc->stage ==
UPDATE_DATA_PTRS)

prepare_to_relocate()
btrfs_commit_transaction()
--> delayed refs are run, so an extent
item for the allocated extent from
bg X is added to extent tree
--> commit roots are switched, so the
next scan in the extent tree will
see the extent item

sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>


# db671160 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs_tree.h migration, item types and defines

The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly
to userspace. In order to decode them, full type information is required.

Create a new header, btrfs_tree to contain these since most users won't
need them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33ca9133 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args

struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 18db9ac6 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs.h migration, move feature flags

The compat/compat_ro/incompat feature flags are used by the feature set/get
ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 83288b60 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs.h migration, qgroup limit flags

The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which
fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d4ae133b 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE

BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and
BTRFS_IOC_SET_FSLABEL ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c63c245 29-Oct-2015 Luke Dashjr <luke@dashjr.org>

btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl

32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.

Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c5f4ccb2 16-Mar-2016 Anand Jain <anand.jain@oracle.com>

btrfs: move error handling code together in ctree.h

Looks like we added the incompatible defines in between the error
handling defines in the file ctree.h. Now group them back.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2351d743 16-Mar-2016 Anand Jain <anand.jain@oracle.com>

btrfs: remove unused function btrfs_assert()

Apparently looks like ASSERT does the same intended job,
as intended btrfs_assert().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 34d97007 16-Mar-2016 Anand Jain <anand.jain@oracle.com>

btrfs: rename btrfs_std_error to btrfs_handle_fs_error

btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# bb7ab3b9 04-Mar-2016 Adam Buchbinder <adam.buchbinder@gmail.com>

btrfs: Fix misspellings in comments.

Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ebb8765b 10-Mar-2016 Anand Jain <anand.jain@oracle.com>

btrfs: move btrfs_compression_type to compression.h

So that its better organized.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 73beece9 17-Jul-2015 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix lockdep deadlock warning due to dev_replace

Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955] Possible interrupt unsafe locking scenario:

[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***

Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d5131b65 17-Feb-2016 David Sterba <dsterba@suse.com>

btrfs: drop unused argument in btrfs_ioctl_get_supported_features

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c5868f83 17-Feb-2016 David Sterba <dsterba@suse.com>

btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls

The control device is accessible when no filesystem is mounted and we
may want to query features supported by the module. This is already
possible using the sysfs files, this ioctl is for parity and
convenience.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7e98a7f 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: change max_inline default to 2048

The current practical default is ~4k on x86_64 (the logic is more complex,
simplified for brevity), the inlined files land in the metadata group and
thus consume space that could be needed for the real metadata.

The inlining brings some usability surprises:

1) total space consumption measured on various filesystems and btrfs
with DUP metadata was quite visible because of the duplicated data
within metadata

2) inlined data may exhaust the metadata, which are more precious in case
the entire device space is allocated to chunks (ie. balance cannot
make the space more compact)

3) performance suffers a bit as the inlined blocks are duplicate and
stored far away on the device.

Proposed fix: set the default to 2048

This fixes namely 1), the total filesysystem space consumption will be on
par with other filesystems.

Partially fixes 2), more data are pushed to the data block groups.

The characteristics of 3) are based on actual small file size
distribution.

The change is independent of the metadata blockgroup type (though it's
most visible with DUP) or system page size as these parameters are not
trival to find out, compared to file size.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0138b6fe 26-Jan-2016 Byongho Lee <bhlee.kernel@gmail.com>

btrfs: simplify expression in btrfs_calc_trans_metadata_size()

Simplify expression in btrfs_calc_trans_metadata_size().

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2fefd558 07-Jan-2016 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: reada: limit max works count

Reada creates 2 works for each level of tree recursively.

In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.

The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 02873e43 31-Dec-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: reada: Use fs_info instead of root in __readahead_hook's argument

What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 96da0919 18-Jan-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Introduce new mount option to disable tree log replay

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8dcddfa0 18-Jan-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Introduce new mount option usebackuproot to replace recovery

Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.

Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.

Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.

This provides the basis for later unified "norecovery" mount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
docs ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 242e2956 25-Jan-2016 David Sterba <dsterba@suse.com>

btrfs: switch dev stats item to the permanent item key

Signed-off-by: David Sterba <dsterba@suse.com>


# 50c2d5ab 25-Jan-2016 David Sterba <dsterba@suse.com>

btrfs: introduce key type for persistent permanent items

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree.

Similar to the temporary items, we'll introduce a new name for an
existing key value and use the objectid for further extension. The
victim is the BTRFS_DEV_STATS_KEY (248).

The device stats are an example of a permanent item.

Signed-off-by: David Sterba <dsterba@suse.com>


# 0bbbccb1 25-Jan-2016 David Sterba <dsterba@suse.com>

btrfs: introduce key type for persistent temporary items

The number of distinct key types is not that big that we could waste one
for something new we want to store in the tree. We'll introduce a new
name for an existing key value and use the objectid for further
extension. The victim is the BTRFS_BALANCE_ITEM_KEY (248).

The nature of the balance status item is a good example of the temporary
item. It exists from beginning of the balance, keeps the status until it
finishes.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9703fefe 21-Jan-2016 Chandan Rajendra <chandan@linux.vnet.ibm.com>

Btrfs: fallocate: Work with sectorsized blocks

While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2e78c927 21-Jan-2016 Chandan Rajendra <chandan@linux.vnet.ibm.com>

Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size

Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this by doing reservation/releases in block size units.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0bc19f90 06-Jan-2016 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: merge functions for wait snapshot creation

wait_for_snapshot_creation() is in same group with oher two:
btrfs_start_write_no_snapshoting()
btrfs_end_write_no_snapshoting()

Rename wait_for_snapshot_creation() and move it into same place
with other two.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c2d6cb16 15-Jan-2016 Filipe Manana <fdmanana@suse.com>

Btrfs: fix deadlock running delayed iputs at transaction commit time

While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:

[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.

Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.

Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4fb72bf2 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path locks

The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:

* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:

text data bss dec hex filename
938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after

(and the generated assembly does not change much)

The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.

Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7853f15b 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path lowest_level

The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.

Signed-off-by: David Sterba <dsterba@suse.com>


# dccabfad 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: use smaller type for btrfs_path reada

The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.

Signed-off-by: David Sterba <dsterba@suse.com>


# e4058b54 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: cleanup, use enum values for btrfs_path reada

Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba <dsterba@suse.com>


# 4d4ab6d6 19-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: constify static arrays

There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.

Signed-off-by: David Sterba <dsterba@suse.com>


# ee22184b 14-Dec-2015 Byongho Lee <bhlee.kernel@gmail.com>

Btrfs: use linux/sizes.h to represent constants

We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9780c497 18-Oct-2015 Alexandru Moise <00moses.alexander00@gmail.com>

btrfs: switch __btrfs_fs_incompat return type from int to bool

Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2b3909f8 19-Dec-2015 Darrick J. Wong <darrick.wong@oracle.com>

btrfs: use new dedupe data function pointer

Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 511711af 30-Dec-2015 Chris Mason <clm@fb.com>

btrfs: don't run delayed references while we are creating the free space tree

This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.

Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.

Signed-off-by: Chris Mason <clm@fb.com>


# 70f6d82e 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: add free space tree mount option

Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a5ed9182 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: implement the free space B-tree

The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.

The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.

With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.

The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree and clear the free space tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 208acb8c 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: introduce the free space B-tree on-disk format

The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 73fa48b6 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: refactor caching_thread()

We're also going to load the free space tree from caching_thread(), so
we should refactor some of the common code.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1abfbcdf 29-Sep-2015 Omar Sandoval <osandov@fb.com>

Btrfs: add helpers for read-only compat bits

We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.
While we're add it, also add helpers to clear the bits.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 04b38d60 02-Dec-2015 Christoph Hellwig <hch@lst.de>

vfs: pull btrfs clone API to vfs layer

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development. To avoid growth of various slightly incompatible
implementations, add one to the VFS. Note that clones are different from
file copies in several ways:

- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure. The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 30424601 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: remove wait from struct btrfs_delalloc_work

The value is 0 and never changes, we can propagate the value, remove
wait and the implied dead code.

Signed-off-by: David Sterba <dsterba@suse.com>


# 651d494a 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: sink parameter wait to btrfs_alloc_delalloc_work

There's only one caller and single value, we can propagate it down to
the callee and remove the unused parameter.

Signed-off-by: David Sterba <dsterba@suse.com>


# 3db11b2e 10-Nov-2015 Zach Brown <zab@redhat.com>

btrfs: add .copy_file_range file operation

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 758f2dfc 19-Nov-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix scrub preventing unused block groups from being deleted

Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:

CPU 1 CPU 2

cleaner kthread
btrfs_delete_unused_bgs()

gets block group X from
fs_info->unused_bgs and
removes it from that list

scrub_enumerate_chunks()

searches device tree using
its commit root

finds device extent for
block group X

gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())

sets bg X to RO

sees the block group is
already RO and therefore
doesn't delete it nor adds
it back to unused list

So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7fd01182 13-Nov-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix the number of transaction units needed to remove a block group

We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.

While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.

However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.

In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:

http://www.spinics.net/lists/linux-btrfs/msg46070.html

So fix this by reserving all the necessary units of metadata.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8eab77ff 13-Nov-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: use global reserve when deleting unused block group after ENOSPC

It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c62d2555 06-Nov-2015 Michal Hocko <mhocko@suse.com>

mm, fs: introduce mapping_gfp_constraint()

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5846a3c2 26-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans

Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().

This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.

This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# bc309467 20-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: extend balance filter usage to take minimum and maximum

Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# dee32d0a 28-Sep-2015 Gabríel Arthúr Pétursson <gabriel@system.is>

btrfs: add balance filter for stripes

Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.

Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>


# 12907fc7 10-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: extend balance filter limit to take minimum and maximum

The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.

Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# b06c4bf5 23-Oct-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix regression running delayed references when using qgroups

In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:

commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}

This happens on a scenario like the following:

1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.

4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.

5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.

7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.

8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.

9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).

So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").

The use of no_quota was also buggy in at least two places:

1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled

2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.

This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.

Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).

Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:

INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
(...)
Call Trace:
[<ffffffff86999201>] schedule+0x74/0x83
[<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
[<ffffffff86137ed7>] ? wait_woken+0x74/0x74
[<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
[<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
[<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
[<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
[<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
[<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
[<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
[<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
[<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
[<ffffffff863e852e>] check_ref+0x64/0xc4
[<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
[<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
[<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
[<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
(...)

The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).

Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>


# c759c4e1 02-Oct-2015 Josef Bacik <jbacik@fb.com>

Btrfs: don't keep trying to build clusters if we are fragmented

If we are extremely fragmented then we won't be able to create a free_cluster.
So if this happens set last_ptr->fragmented so that all future allcations will
give up trying to create a cluster. When we unpin extents we will unset
->fragmented if we free up a sufficient amount of space in a block group.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4f4db217 29-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: keep track of max_extent_size per space_info

When we are heavily fragmented we can induce a lot of latency trying to make an
allocation happen that is simply not going to happen. Thankfully we keep track
of our max_extent_size when going through the allocator, so if we get to the
point where we are exiting find_free_extent with ENOSPC then set our
space_info->max_extent_size so we can keep future allocations from having to pay
this cost. We reset the max_extent_size whenever we release pinned bytes back
into this space info so we can redo all the work. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d0bd4560 23-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: add fragment=* debug mount option

In tracking down these weird bitmap problems it was helpful to artificially
create an extremely fragmented file system. These mount options let us either
fragment data or metadata or both. With these options I could reproduce all
sorts of weird latencies and hangs that occur under extreme fragmentation and
get them fixed. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 51773bec 08-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook

In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.

So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
with io_tree lock hold.

This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7cf5b976 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Cleanup old inaccurate facilities

Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1ada3a62 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space

Add new version of btrfs_delalloc_reserve_space() and
btrfs_delalloc_release_space() functions, which supports accurate qgroup
reserve.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4ceff079 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space.

Add new functions __btrfs_check_data_free_space() and
__btrfs_free_reserved_data_space() to work with new accurate qgroup
reserved space framework.

The new function will replace old btrfs_check_data_free_space() and
btrfs_free_reserved_data_space() respectively, but until all the change
is done, let's just use the new name.

Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as
now qgroup reserve requires precious bytes, some operation can't get the
accurate number in advance(like fallocate).
But data space info check and data chunk allocate doesn't need to be
that accurate, and can be called at the beginning.

So export it for later operations.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 55eeaf05 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce new functions to reserve/free metadata

Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free
metadata reserved space.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1dd6d7ca 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: introduce ratelimited variants of message printing functions

Signed-off-by: David Sterba <dsterba@suse.com>


# 24aa6b41 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: introduce ratelimited _in_rcu variants of message printing functions

Signed-off-by: David Sterba <dsterba@suse.com>


# 08a84e25 08-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: introduce _in_rcu variants of message printing functions

Due to the missing variants there are messages that lack the information
printed by btrfs_info etc helpers.

Signed-off-by: David Sterba <dsterba@suse.com>


# a4553fef 25-Sep-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: consolidate btrfs_error() to btrfs_std_error()

btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6618a59b 14-Aug-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 96f3136e 14-Aug-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a4027a20 09-Jul-2015 Byongho Lee <bhlee.kernel@gmail.com>

Btrfs: remove unused mutex from struct 'btrfs_fs_info'

The code using 'ordered_extent_flush_mutex' mutex has removed by below
commit.
- 8d875f95da43c6a8f18f77869f2ef26e9594fecc
btrfs: disable strict file flushes for renames and truncates
But the mutex still lives in struct 'btrfs_fs_info'.

So, this patch removes the mutex from struct 'btrfs_fs_info' and its
initialization code.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 147d256e 06-Aug-2015 Zhaolei <zhaolei@cn.fujitsu.com>

btrfs: Remove unnecessary variants in relocation.c

These arguments are not used in functions, remove them for cleanup
and make kernel stack happy.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 868f401a 05-Aug-2015 Zhaolei <zhaolei@cn.fujitsu.com>

btrfs: Use ref_cnt for set_block_group_ro()

More than one code call set_block_group_ro() and restore rw in fail.

Old code use bool bit to save blockgroup's ro state, it can not
support parallel case(it is confirmd exist in my debug log).

This patch use ref count to store ro state, and rename
set_block_group_ro/set_block_group_rw
to
inc_block_group_ro/dec_block_group_ro.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e33e17ee 15-Jun-2015 Jeff Mahoney <jeffm@suse.com>

btrfs: add missing discards when unpinning extents with -o discard

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 67c5e7d4 10-Jun-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between balance and unused block group deletion

We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:

[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)

The sequence of steps leading to this are:

CPU 0 CPU 1

btrfs_balance()
btrfs_relocate_chunk()

btrfs_relocate_block_group(bg X)
btrfs_lookup_block_group(bg X)

cleaner_kthread
locks fs_info->cleaner_mutex

btrfs_delete_unused_bgs()
finds bg X, which became
unused in the previous
transaction

checks bg X ->ro == 0,
so it proceeds
sets bg X ->ro to 1
(btrfs_set_block_group_ro(bg X))

blocks on fs_info->cleaner_mutex
btrfs_remove_chunk(bg X)
unlocks fs_info->cleaner_mutex

acquires fs_info->cleaner_mutex
relocate_block_group()
--> does nothing, no extents found in
the extent tree from bg X
unlocks fs_info->cleaner_mutex

btrfs_relocate_block_group(bg X) returns

btrfs_remove_chunk(bg X)
extent map not found
--> ASSERT(0)

Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.

This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:

while true; do btrfs balance start -dusage=0 <mountpoint> ; done

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e69bcee3 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.

Goodbye, the old mechanisim.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 20b2e302 04-Jun-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()

lockdep report following warning in test:
[25176.843958] =================================
[25176.844519] [ INFO: inconsistent lock state ]
[25176.845047] 4.1.0-rc3 #22 Tainted: G W
[25176.845591] ---------------------------------
[25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
[25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
[25176.847838] {SOFTIRQ-ON-W} state was registered at:
[25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
[25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
[25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
[25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
[25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
[25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
[25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
[25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
[25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
[25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
[25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
[25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
[25176.854935] irq event stamp: 51506
[25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
[25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
[25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
[25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
[25176.857746]
other info that might help us debug this:
[25176.858845] Possible unsafe locking scenario:
[25176.859981] CPU0
[25176.860537] ----
[25176.861059] lock(&wr_ctx->wr_lock);
[25176.861705] <Interrupt>
[25176.862272] lock(&wr_ctx->wr_lock);
[25176.862881]
*** DEADLOCK ***

Reason:
Above warning is caused by:
Interrupt
-> bio_endio()
-> ...
-> scrub_put_ctx()
-> scrub_free_ctx() *1
-> ...
-> mutex_lock(&wr_ctx->wr_lock);

scrub_put_ctx() is allowed to be called in end_bio interrupt, but
in code design, it will never call scrub_free_ctx(sctx) in interrupe
context(above *1), because btrfs_scrub_dev() get one additional
reference of sctx->refs, which makes scrub_free_ctx() only called
withine btrfs_scrub_dev().

Now the code runs out of our wish, because free sequence in
scrub_pending_bio_dec() have a gap.

Current code:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
-> scrub_free_ctx() |
-----------------------------------+-----------------------------------

We expected:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
| -> scrub_free_ctx()
-----------------------------------+-----------------------------------

Fix:
Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
in interrupt context.
Tested by check tracelog in debug.

Changelog v1->v2:
Use workqueue instead of adjust function call sequence in v1,
because v1 will introduce a bug pointed out by:
Filipe David Manana <fdmanana@gmail.com>

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4617ea3a 09-Jun-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix necessary chunk tree space calculation when allocating a chunk

When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:

btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
btrfs_calc_trans_metadata_size(chunk_root, 1)

That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 39c2d7fa 20-May-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix -ENOSPC on block group removal

Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.

While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:

[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left

So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4fbcdf66 20-May-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix -ENOSPC when finishing block group creation

While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:

[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.

So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.

Fix this by doing proper reservation in the chunk block reserve.

The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1f6e4b3f 04-May-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Fix superblock csum type check.

Old csum type check is wrong and can't catch csum_type 1(not supported).

Fix it to avoid hostile 0 division.

Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# c0d19e2b 24-Apr-2015 David Sterba <dsterba@suse.cz>

btrfs: add 'cold' compiler annotations to all error handling functions

The annotated functios will be placed into .text.unlikely section. The
annotation also hints compiler to move the code out of the hot paths,
and may implicitly mark if-statement leading to that block as unlikely.

This is a heuristic, the impact on the generated code is not
significant.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 1a9a8a71 24-Apr-2015 David Sterba <dsterba@suse.cz>

btrfs: report exact callsite where transaction abort occurs

WARN is called from a single location and all bugreports say that's in
super.c __btrfs_abort_transaction. This is slightly confusing as we'd
rather want to know the exact callsite. Whereas this information is
printed in the syslog below the stacktrace, this requires further look
and we usually see only the headline from WARNING.

Moving the WARN into the macro has to inline some code and increases
code by a few kilobytes:

text data bss dec hex filename
835481 20305 14120 869906 d4612 btrfs.ko.before
842883 20305 14120 877308 d62fc btrfs.ko.after

The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
The increase is not small and could lead to worse icache use. The code
is on error/exit paths that can be recognized by compiler as cold and
moved out of the way so the impact is speculated to be low, if
measurable at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 2e7910d6 09-Mar-2015 Anand Jain <Anand.Jain@oracle.com>

Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices

This patch will provide a framework and help to create attributes
from the structure btrfs_fs_devices which are available even before
fs_info is created. So by moving the parent kobject super_kobj from
fs_info to btrfs_fs_devices, it will help to create attributes
from the btrfs_fs_devices as well.

Patches on top of this patch now will be able to create the
sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
when devices are scanned and registered to the kernel.

Just to note, this does not change any of the existing btrfs sysfs
external kobject names and its attributes and not even the life
cycle of them. Changes are internal only. And to ensure the same,
this path has been tested with various device operations and,
checking and comparing the sysfs kobjects and attributes with
sysfs kobject and attributes with out this patch, and they remain
same.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# e09fe2d2 27-Feb-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created

Btrfs will create qgroup on subvolume creation if quota is enabled, but
qgroup uses the high bits(currently 16 bits) as level, to build the
inheritance.

However it is fully possible a subvolume can be created with a
subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be
considered as level 1 and can't be assigned to other qgroup in level 1.

This patch will prevent such things so qgroup inheritance will not be
screwed up.
The downside is very clear, btrfs subvolume number limit will decrease
from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to
(u48 max -256(first free objectid)).
But we still have near u48(that's 15 digits in dec), so that should not
be a huge problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8465ecec 27-Feb-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Check qgroup level in kernel qgroup assign.

Although we have qgroup level check in btrfs-progs, it's not enough
since other programe may still call ioctl directly not using
btrfs-progs. For example, systemd.

But it's btrfs-progs to be blame since we don't provide a
full-function(like subvolume create things) btrfs library with enough
check, and only rely on kernel ioctl.

So Add level checks in kernel too.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e2d1f923 06-Feb-2015 Dongsheng Yang <yangds.fnst@cn.fujitsu.com>

btrfs: qgroup: do a reservation in a higher level.

There are two problems in qgroup:

a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.

b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.

The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.

In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d7c15171 25-Feb-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: Fix NO_SPACE bug caused by delayed-iput

Steps to reproduce:
while true; do
dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
rm /btrfs_dir/file
sync
done

And we'll see dd failed because btrfs return NO_SPACE.

Reason:
Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
in end to free fs space for next write, but sometimes it hadn't
done work on time, because btrfs-cleaner thread get delayed-iputs
from list before, but do iput() after next write.

This is log:
[ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin

[ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
[ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
[ 2569.087554] comm=sync func=btrfs_commit_transaction() end

[ 2569.191081] comm=dd begin
[ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28

[ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
[ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
...
[ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
[ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end

Fix:
Make btrfs_commit_transaction() wait current running btrfs-cleaner's
delayed-iputs() done in end.

Test:
Use script similar to above(more complex),
before patch:
7 failed in 100 * 20 loop.
after patch:
0 failed in 100 * 20 loop.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# cdfb080e 06-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: fix use after free when close_ctree frees the orphan_rsv

Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv. The problem is this call updates the
space_info, which has already been freed.

This adds a new __ function that directly calls kfree instead of trying
to update the space infos.

Signed-off-by: Chris Mason <clm@fb.com>


# 1bbc621e 06-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: allow block group cache writeout outside critical section in commit

We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>


# c9dc4c65 04-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: two stage dirty block group writeout

Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one. This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason <clm@fb.com>


# 4c6d1d85 06-Apr-2015 Chris Mason <clm@fb.com>

btrfs: move struct io_ctl into ctree.h and rename it

We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h

Signed-off-by: Chris Mason <clm@fb.com>


# 28f75a0e 04-Feb-2015 Chris Mason <clm@fb.com>

Btrfs: refill block reserves during truncate

When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents. This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason <clm@fb.com>


# 6a3891c5 16-Mar-2015 Josef Bacik <jbacik@fb.com>

Btrfs: add sanity test for outstanding_extents accounting

I introduced a regression wrt outstanding_extents accounting. These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time. So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>


# dcdf7f6d 02-Mar-2015 Josef Bacik <jbacik@fb.com>

Btrfs: prepare block group cache before writing

Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff. To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>


# 3284da7b 25-Feb-2015 David Sterba <dsterba@suse.cz>

btrfs: use explicit initializer for seq_elem

Using {} as initializer for struct seq_elem does not properly initialize
the list_head member, but it currently works because it gets set through
btrfs_get_tree_mod_seq if 'seq' is 0.

Signed-off-by: David Sterba <dsterba@suse.cz>


# b7a0365e 11-Nov-2014 Daniel Dressler <danieru.dressler@gmail.com>

Btrfs: ctree: reduce args where only fs_info used

This patch is part of a larger project to cleanup btrfs's internal usage
of struct btrfs_root. Many functions take btrfs_root only to grab a
pointer to fs_info.

This causes programmers to ponder which root can be passed. Since only
the fs_info is read affected functions can accept any root, except this
is only obvious upon inspection.

This patch reduces the specificty of such functions to accept the
fs_info directly.

This patch does not address the two functions in ctree.c (insert_ptr,
and split_item) which only use root for BUG_ONs in ctree.c

This patch affects the following functions:
1) fixup_low_keys
2) btrfs_set_item_key_safe

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# dcab6a3b 11-Feb-2015 Josef Bacik <jbacik@fb.com>

Btrfs: account for large extents with enospc

On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d4b450cd 29-Jan-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between transaction commit and empty block group removal

Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

CPU 1 CPU 2

btrfs_commit_transaction()
fs_info->running_transaction = NULL
btrfs_finish_extent_commit()
find_first_extent_bit()
-> found range for block group X
in fs_info->freed_extents[]

btrfs_delete_unused_bgs()
-> found block group X

Removed block group X's range
from fs_info->freed_extents[]

btrfs_remove_chunk()
btrfs_remove_block_group(bg X)

unpin_extent_range(bg X range)
btrfs_lookup_block_group(bg X)
-> returns NULL
-> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

Btrfs: remove empty block groups automatically
commit 47ab2a6c689913db23ccae38349714edf8365e0a

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a937b979 12-Dec-2014 David Sterba <dsterba@suse.cz>

btrfs: kill btrfs_inode_*time helpers

They just opencode taking address of the timespec member.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 78f55e5e 12-Jan-2015 Anand Jain <Anand.Jain@oracle.com>

Btrfs: fix unused members in struct btrfs_root

There isn't any real use of following members of struct btrfs_root
so delete them.

struct kobject root_kobj;
struct completion kobj_unregister;

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# ffe2d203 20-Jan-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply

So we can check raid56 with:
(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
(map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ce93ec54 17-Nov-2014 Josef Bacik <jbacik@fb.com>

Btrfs: track dirty block groups on their own list

Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set. This function
can get called several times during a commit. So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time. Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups. This allows us to clean up how we write the
free space cache out so it is much cleaner. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e7070be1 16-Dec-2014 Josef Bacik <jbacik@fb.com>

Btrfs: change how we track dirty roots

I've been overloading root->dirty_list to keep track of dirty roots and which
roots need to have their commit roots switched at transaction commit time. This
could cause us to lose an update to the root which could corrupt the file
system. To fix this use a state bit to know if the root is dirty, and if it
isn't set we go ahead and move the root to the dirty list. This way if we
re-dirty the root after adding it to the switch_commit list we make sure to
update it. This also makes it so that the extent root is always the last root
on the dirty list to try and keep the amount of churn down at this point in the
commit. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 75c68e9f 16-Jan-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race deleting block group from space_info->ro_bgs list

When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.

This issue was introduced in the 3.19 kernel by the following change:

Btrfs: move read only block groups onto their own list V2
commit 633c0aad4c0243a506a3e8590551085ad78af82d

I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1edb647b 08-Dec-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: remove non-sense btrfs_error_discard_extent() function

It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 04216820 27-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between fs trimming and block group remove/allocation

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4f69cb98 26-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix crash caused by block group removal

If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>] [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8 EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS: 0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238] ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238] ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238] ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238] [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238] [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238] [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238] [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238] [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4245215d 25-Nov-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 9ea24bbe 29-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix snapshot inconsistency after a file write followed by truncate

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ xfs_io -f \
-c "pwrite -S 0xaa -b 32K 0 32K" \
-c "fsync" \
-c "pwrite -S 0xbb -b 32770 16K 32770" \
-c "truncate 90123" \
/mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

"root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# b38ef71c 13-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: ensure ordered extent errors aren't missed on fsync

When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:

1) We start all ordered extents by calling filemap_fdatawrite_range();

2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;

3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;

4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).

So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.

This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5f5bc6b1 09-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: make xattr replace operations atomic

Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.

This change also fixes 2 other existing issues which were:

*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);

*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.

A test case for xfstests follows soon.

Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.

Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 633c0aad 31-Oct-2014 Josef Bacik <jbacik@fb.com>

Btrfs: move read only block groups onto their own list V2

Our gluster boxes were spending lots of time in statfs because our fs'es are
huge. The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups. Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 728404da 10-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: add helper btrfs_fdatawrite_range

To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d51033d0 12-Nov-2014 David Sterba <dsterba@suse.cz>

btrfs: introduce pending action: commit

In some contexts, like in sysfs handlers, we don't want to trigger a
transaction commit. It's a heavy operation, we don't know what external
locks may be taken. Instead, make it possible to finish the operation
through sync syscall or SYNC_FS ioctl.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 7e1876ac 05-Feb-2014 David Sterba <dsterba@suse.cz>

btrfs: switch inode_cache option handling to pending changes

The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.

Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 572d9ab7 05-Feb-2014 David Sterba <dsterba@suse.cz>

btrfs: add support for processing pending changes

There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.

For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).

Signed-off-by: David Sterba <dsterba@suse.cz>


# 1a4ed8fd 27-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()

If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.

Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 0d4cf4e6 07-Oct-2014 Chris Mason <clm@fb.com>

Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off

Commit fccb84c94 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.

This fixes things to work when they are turned off too.

Signed-off-by: Chris Mason <clm@fb.com>


# f667aef6 22-Sep-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Make btrfs handle security mount options internally to avoid losing security label.

[BUG]
Originally when mount btrfs with "-o subvol=" mount option, btrfs will
lose all security lable.
And if the btrfs fs is mounted somewhere else, due to the lost of
security lable, SELinux will refuse to mount since the same super block
is being mounted using different security lable.

[REPRODUCER]
With SELinux enabled:
#mkfs -t btrfs /dev/sda5
#mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
#btrfs subvolume create /mnt/btrfs/subvol
#mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
/mnt/test

kernel message:
SELinux: mount invalid. Same superblock, different security settings
for (dev sda5, type btrfs)

[REASON]
This happens because btrfs will call vfs_kern_mount() and then
mount_subtree() to handle subvolume name lookup.
First mount will cut off all the security lables and when it comes to
the second vfs_kern_mount(), it has no security label now.

[FIX]
This patch will makes btrfs behavior much more like nfs,
which has the type flag FS_BINARY_MOUNTDATA,
making btrfs handles the security label internally.
So security label will be set in the real mount time and won't lose
label when use with "subvol=" mount option.

Reported-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 15b636e1 25-Sep-2014 Fabian Frederick <fabf@skynet.be>

Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.

Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
and keep prototype in qgroup.h only

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>


# fccb84c9 29-Sep-2014 David Sterba <dsterba@suse.cz>

btrfs: move checks for DUMMY_ROOT into a helper

Signed-off-by: David Sterba <dsterba@suse.cz>


# 7ec20afb 24-Jul-2014 David Sterba <dsterba@suse.cz>

btrfs: new define for the inline extent data start

Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 95ac567a 08-Aug-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: set default max_inline to 8KiB instead of 8MiB

8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>


# 4d75f8a9 14-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: remove blocksize from btrfs_alloc_free_block and rename

Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 47ab2a6c 18-Sep-2014 Josef Bacik <jbacik@fb.com>

Btrfs: remove empty block groups automatically

One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space. This happens because all the chunks were allocated for
data since the metadata requirements were so low. But now there's a bunch of
empty data block groups and not enough metadata space to do anything. This
patch solves this problem by automatically deleting empty block groups. If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it. This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8b110e39 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: implement repair function when direct read fails

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 23ea8e5a 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: load checksum data once when submitting a direct read io

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f87c4318 20-Aug-2014 David Sterba <dsterba@suse.cz>

btrfs: remove stale define after removing ordered operations

Last user removed in commit "btrfs: disable strict file flushes for
renames and truncates" (8d875f95da43c6a8f18f77869f2ef26e9594fecc).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# c01a5c07 16-Jul-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: fix wrong max inline data size limit

inline data is stored from offset of @disk_bytenr in
struct btrfs_file_extent_item. So substracting total
size of struct btrfs_file_extent_item is wrong, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 707e8a07 04-Jun-2014 David Sterba <dsterba@suse.cz>

btrfs: use nodesize everywhere, kill leafsize

The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

text data bss dec hex filename
852418 24560 23112 900090 dbbfa btrfs.ko.before
851074 24584 23112 898770 db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 57cdc8db 04-Feb-2014 David Sterba <dsterba@suse.cz>

btrfs: cleanup ino cache members of btrfs_root

The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# e339a6b0 02-Jul-2014 Josef Bacik <jbacik@fb.com>

Btrfs: __btrfs_mod_ref should always use no_quota

Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there. With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e570fd27 18-Jun-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix broken free space cache after the system crashed

When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7ffbb598 08-Jun-2014 Filipe Manana <fdmanana@gmail.com>

Btrfs: make fsync work after cloning into a file

When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c1895442 26-May-2014 Jeff Mahoney <jeffm@suse.com>

btrfs: allocate raid type kobjects dynamically

We are currently allocating space_info objects in an array when we
allocate space_info. When a user does something like:

# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
# btrfs balance start -mconvert=single -dconvert=single /mnt -f
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /

We can end up with memory corruption since the kobject hasn't
been reinitialized properly and the name pointer was left set.

The rationale behind allocating them statically was to avoid
creating a separate kobject container that just contained the
raid type. It used the index in the array to determine the index.

Ultimately, though, this wastes more memory than it saves in all
but the most complex scenarios and introduces kobject lifetime
questions.

This patch allocates the kobjects dynamically instead. Note that
we also remove the kobject_get/put of the parent kobject since
kobject_add and kobject_del do that internally.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# a79b7d4b 22-May-2014 Chris Mason <clm@fb.com>

Btrfs: async delayed refs

Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.

This farms them off to async helper threads. The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.

Signed-off-by: Chris Mason <clm@fb.com>


# faa2dbf0 07-May-2014 Josef Bacik <jbacik@fb.com>

Btrfs: add sanity tests for new qgroup accounting code

This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# fcebe456 13-May-2014 Josef Bacik <jbacik@fb.com>

Btrfs: rework qgroup accounting

Currently qgroups account for space by intercepting delayed ref updates to fs
trees. It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly. The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together. Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc. This patch
accomplishes this by only adding qgroup operations for real ref changes. We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time. This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq: we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence. This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point. This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 27cdeb70 02-Apr-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 21c7e756 13-May-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: reclaim the reserved metadata space at background

Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)

Test command:
# compilebench -D <mnt> -m

Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 521e0546 15-Apr-2014 David Sterba <dsterba@suse.cz>

btrfs: protect snapshots from deleting during send

The patch "Btrfs: fix protection between send and root deletion"
(18f687d538449373c37c) does not actually prevent to delete the snapshot
and just takes care during background cleaning, but this seems rather
user unfriendly, this patch implements the idea presented in

http://www.spinics.net/lists/linux-btrfs/msg30813.html

- add an internal root_item flag to denote a dead root
- check if the send_in_progress is set and refuse to delete, otherwise
set the flag and proceed
- check the flag in send similar to the btrfs_root_readonly checks, for
all involved roots

The root lookup in send via btrfs_read_fs_root_no_name will check if the
root is really dead or not. If it is, ENOENT, aborted send. If it's
alive, it's protected by send_in_progress, send can continue.

CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 7d824b6f 07-May-2014 David Sterba <dsterba@suse.cz>

btrfs: balance filter: add limit of processed chunks

This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.

In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 9d89ce65 23-Apr-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: move btrfs_{set,clear}_and_info() to ctree.h

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 36523e95 07-Feb-2014 David Sterba <dsterba@suse.cz>

btrfs: export global block reserve size as space_info

Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# 3f8a18cc 28-Mar-2014 Josef Bacik <jbacik@fb.com>

Btrfs: hold the commit_root_sem when getting the commit root during send

We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 9e351cc8 13-Mar-2014 Josef Bacik <jbacik@fb.com>

Btrfs: remove transaction from send

Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 573bfb72 05-Mar-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix possible empty list access when flushing the delalloc inodes

We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 31f3d255 05-Mar-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: split the global ordered extents mutex

When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 6c255e67 05-Mar-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock

We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 8257b2dc 05-Mar-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume

If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# d458b054 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Cleanup the "_struct" suffix in btrfs_workequeue

Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# a046e9c8 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Cleanup the old btrfs_worker.

Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.

Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 0339ef2f 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.

Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# fc97fab0 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.

Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 5b3bc44e 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue.

Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# dc6e3209 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.

Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 736cfa15 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue.

Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# e66f0bb1 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.

Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# d05a33ac 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue.

Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# fccb5d86 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.

Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# a44903ab 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->flush_workers with btrfs_workqueue.

Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# a8c93d4e 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->submit_workers with btrfs_workqueue.

Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# afe3d242 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue

Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 5cdc7ad3 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->workers with btrfs_workqueue.

Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# d1433deb 20-Feb-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: just wait or commit our own log sub-transaction

We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 8b050d35 20-Feb-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix skipped error handle when log sync failed

It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# bb14a59b 20-Feb-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use signed integer instead of unsigned long integer for log transid

The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# c404e0dc 30-Jan-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix use-after-free in the finishing procedure of the device replace

During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.

For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.

For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.

Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 514ac8ad 03-Jan-2014 Chris Mason <clm@fb.com>

Btrfs: don't use ram_bytes for uncompressed inline items

If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size. The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.

Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org


# 26b47ff6 15-Jan-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss

It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 3818aea2 12-Jan-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Add noinode_cache mount option

Add noinode_cache mount option for btrfs.

Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.

To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.

Also noinode_cache mount option will not stop the caching_kthread.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ade2e0b3 12-Jan-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: fix to search previous metadata extent item since skinny metadata

There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 0a2b2a84 23-Jan-2014 Josef Bacik <jbacik@fb.com>

Btrfs: throttle delayed refs better

On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 63541927 07-Jan-2014 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: add support for inode properties

This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06

* With 1 property (compression property set to lzo - test 2)

file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11

* With 4 properties (test 3)

file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13

* With 10 properties (test 4)

file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
#!/usr/bin/perl -w

use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');

system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);

$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1acae57b 07-Jan-2014 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: faster file extent item replace operations

When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

sysbench --test=fileio --file-num=4096 --file-total-size=8G \
--file-test-mode=[rndwr|seqwr] --num-threads=512 \
--file-block-size=8192 \ --max-requests=1000000 \
--file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change: 277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change: 56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change: 69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change: 14.39 Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# efe120a0 20-Dec-2013 Frank Holton <fholton@gmail.com>

Btrfs: convert printk to btrfs_ and fix BTRFS prefix

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2c686537 16-Dec-2013 David Sterba <dsterba@suse.cz>

btrfs: Check read-only status of roots during send

All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.

Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.

We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.

Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e223cfcd 13-Dec-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct

It's not used anywhere, so just drop it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f28491e0 16-Dec-2013 Josef Bacik <jbacik@fb.com>

Btrfs: move the extent buffer radix tree into the fs_info

I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 27a0dd61 12-Nov-2013 Frank Holton <fholton@gmail.com>

Btrfs: make btrfs_debug match pr_debug handling related to DEBUG

The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 33b98f22 13-Nov-2013 Sergei Trofimovich <slyfox@gentoo.org>

btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'

Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e33d5c3d 04-Nov-2013 Kelley Nielsen <kelleynnn@gmail.com>

btrfs: bootstrap generic btrfs_find_item interface

There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.

This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 29e5be24 01-Nov-2013 Jeff Mahoney <jeffm@suse.com>

btrfs: publish device membership in sysfs

Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 6ab0a202 01-Nov-2013 Jeff Mahoney <jeffm@suse.com>

btrfs: publish allocation data in sysfs

While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5ac1d209 01-Nov-2013 Jeff Mahoney <jeffm@suse.com>

btrfs: publish per-super attributes in sysfs

This patch adds per-super attributes to sysfs.

It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2eaa055f 15-Nov-2013 Jeff Mahoney <jeffm@suse.com>

btrfs: add ioctls to query/change feature bits online

There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.

We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.

We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.

The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e20d6c5b 13-Nov-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix check-integrity to look at the referenced data properly

We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression. Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced. This fixes check_int_data freaking out when running
btrfs/004. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 16e7549f 21-Oct-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: incompatible format change to remove hole extents

Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 996a710d 20-Dec-2013 Christoph Hellwig <hch@infradead.org>

btrfs: use generic posix ACL infrastructure

Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9650e05c 12-Nov-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: remove dead codes from ctree.h

These two functions are only stated but undefined.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 54563d41 01-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

btrfs: get rid of fdentry()

3 of 4 callers actually want file_inode()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 91aef86f 04-Nov-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: rename btrfs_start_all_delalloc_inodes

rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9b011adf 25-Oct-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: remove scrub_super_lock holding in btrfs_sync_log()

Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# aaedb55b 11-Oct-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add tests for btrfs_get_extent

I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime. This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 294e30fe 08-Oct-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add tests for find_lock_delalloc_range

So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way. So this is obviously a
candidate for some testing. This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages. Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to. With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 6174d3cb 01-Oct-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: remove unused max_key arg from btrfs_search_forward

It is not used for anything.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 0a4e5586 24-Sep-2013 Ross Kirk <ross.kirk@gmail.com>

btrfs: remove unused parameter from btrfs_header_fsid

Remove unused parameter, 'eb'. Unused since introduction in
5f39d397dfbe140a14edecd4e73c34ce23c4f9ee

Updated to be rebased against current upstream and correct diff supplied this time!

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 06ea65a3 19-Sep-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add a sanity test for btrfs_split_item

While looking at somebodys corruption I became completely convinced that
btrfs_split_item was broken, so I wrote this test to verify that it was working
as it was supposed to. Thankfully it appears to be working as intended, so just
add this test to make sure nobody breaks it in the future. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# dd3cc16b 16-Sep-2013 Ross Kirk <ross.kirk@gmail.com>

btrfs: drop unused parameter from btrfs_item_nr

Remove unused eb parameter from btrfs_item_nr

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 363e4d35 17-Sep-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: remove space_info->reservation_progress

This isn't used for anything anymore, just remove it.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# c4fbb430 17-Sep-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix worst case calculator for space usage

Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right. If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 83d4cfd4 30-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fixup error handling in btrfs_reloc_cow

If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent. So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# d8f98039 24-Aug-2013 Filipe David Borba Manana <fdmanana@gmail.com>

Btrfs: fix memory leak of uuid_root in free_fs_info

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 2e17c7c6 26-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add support for asserts

One of the complaints we get a lot is how many BUG_ON()'s we have. So to help
with this I'm introducing a kconfig option to enable/disable a new ASSERT()
mechanism much like what XFS does. This will allow us developers to still get
our nice panics but allow users/distros to compile them out. With this we can
go through and convert any BUG_ON()'s that we have to catch actual programming
mistakes to the new ASSERT() and then fix everybody else to return errors. This
will also allow developers to leave sanity checks in their new code to make sure
we don't trip over problems while testing stuff and vetting new features.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# b308bc2f 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long

Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
casts it to a pointer, while all callers cast it to unsigned long again.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# fba6aa75 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make btrfs_header_fsid() return unsigned long

Internally, btrfs_header_fsid() calculates an unsigned long, but casts
it to a pointer, while all callers cast it to unsigned long again.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 231e88f4 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long

Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long,
but casts it to a pointer, while all callers cast it to unsigned long
again.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 1473b24e 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make btrfs_device_fsid() return unsigned long

All callers of btrfs_device_fsid() cast its return type to unsigned long.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 410ba3a2 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make btrfs_device_uuid() return unsigned long

All callers of btrfs_device_uuid() cast its return type to unsigned long.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 6e71c47a 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org>

Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant

The internal btrfs device id is a u64, hence make the constant
BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need
a cast to print it.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# f420ee1e 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add mount option to force UUID tree checking

This should never be needed, but since all functions are there
to check and rebuild the UUID tree, a mount option is added that
allows to force this check and rebuild procedure.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 70f80175 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: check UUID tree during mount if required

If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).

This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 26432799 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: introduce uuid-tree-gen field

In order to be able to detect the case that a filesystem is mounted
with an old kernel, add a uuid-tree-gen field like the free space
cache is doing it. It is part of the super block and written with
each commit. Old kernels do not know this field and don't update it.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 803b2f54 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: fill UUID tree initially

When the UUID tree is initially created, a task is spawned that
walks through the root tree. For each found subvolume root_item,
the uuid and received_uuid entries in the UUID tree are added.
This is such a quick operation so that in case somebody wants
to unmount the filesystem while the task is still running, the
unmount is delayed until the UUID tree building task is finished.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# dd5f9615 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: maintain subvolume items in the UUID tree

When a new subvolume or snapshot is created, a new UUID item is added
to the UUID tree. Such items are removed when the subvolume is deleted.
The ioctl to set the received subvolume UUID is also touched and will
now also add this received UUID into the UUID tree together with the
ID of the subvolume. The latter is also done when read-only snapshots
are created which inherit all the send/receive information from the
parent subvolume.

User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and
read in the UUID tree.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# f7a81ea4 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: create UUID tree if required

This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 07b30a49 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: introduce a tree for items that map UUIDs to something

Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.

It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.

Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.

This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.

Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:

The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.

To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 171170c1 14-Aug-2013 Sergei Trofimovich <slyfox@gentoo.org>

btrfs: mark some local function as 'static'

Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 35a3621b 14-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: get rid of sparse warnings

make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.

All these changes should be obviously safe.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# ba5e8f2e 16-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix send issues related to inode number reuse

If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only. The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves. So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root. If the items match exactly then we don't do
anything. This doesn't quite work for inode refs, since they will just have the
name and the parent objectid. If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.

In order to fix this we need to do extra checking to see if the inode ref really
is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match. Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing. The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume. If they match then they are the same directory
and we are good to go. If they don't we have to add them to the changed refs
list.

This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode. So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.

There was also the issue of using a ulist to keep track of the directories we
needed to check. Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value. So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.

Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32). With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 00361589 14-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: avoid starting a transaction in the write path

I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9ffba8cd 14-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix heavy delalloc related deadlock

I added a patch where we started taking the ordered operations mutex when we
waited on ordered extents. We need this because we splice the list and process
it, so if a flusher came in during this scenario it would think the list was
empty and we'd usually get an early ENOSPC. The problem with this is that this
lock is used in transaction committing. So we end up with something like this

Transaction commit
-> wait on writers

Delalloc flusher
-> run_ordered_operations (holds mutex)
->wait for filemap-flush to do its thing

flush task
-> cow_file_range
->wait on btrfs_join_transaction because we're commiting

some other task
-> commit_transaction because we notice trans->transaction->flush is set
-> run_ordered_operations (hang on mutex)

We need to disentangle the ordered operations flushing from the delalloc
flushing, since they are separate things. This solves the deadlock issue I was
seeing. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 8b87dc17 01-Aug-2013 David Sterba <dsterba@suse.cz>

btrfs: add mount option to set commit interval

I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 36cce922 05-Aug-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: handle errors when doing slow caching

Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
forever if a drive failed. This is because we just bail out if there is an
error while trying to cache a block group, we don't update anybody who may be
waiting. So this introduces a new enum for the cache state in case of error and
makes everybody bail out if we have an error. Alex tested and verified this
patch fixed his problem. This fixes bz 59431. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# facc8a22 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't cache the csum value into the extent state tree

Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 3cae210f 15-Jul-2013 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert

Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.

Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# ee3441b4 09-Jul-2013 Jeff Mahoney <jeffm@suse.com>

btrfs: fall back to global reservation when removing subvolumes

I recently did some ENOSPC testing that involved filling the disk
while create and removing snapshots in a loop. During the test cycle,
I ran into an ENOSPC when trying to remove a snapshot, leaving the fs
stuck in ENOSPC even after a umount/mount cycle.

This patch allow subvolume removal to fall back onto the global
block reservation in order to succeed when it would have failed
otherwise.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 7ee9e440 21-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: check if we can nocow if we don't have data space

We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b150a4f1 19-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: use a percpu to keep track of possibly pinned bytes

There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation. This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space. So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space. So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free. We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction. With this patch we now pass xfstests 226. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 1be41b78 12-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix transaction throttling for delayed refs

Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram. This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs. What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run. To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve. With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster. Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 8c2a1a30 06-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: exclude logged extents before replying when we are mixed

With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them. However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log. To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents. With this patch we now pass xfstests generic/311 with mixed
block groups turned on. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b382a324 28-May-2013 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: fix qgroup rescan resume on mount

When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.

First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.

Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable

Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]

qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.

We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.

As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d52be818 29-May-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: simplify unlink reservations

Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files. The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try. So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation. If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve. With this patch Dave's reproducer now works and I can rm all the files
on the file system. Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4a9d8bde 16-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make the state of the transaction more readable

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 199c2a9c 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: introduce per-subvolume ordered extent list

The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# eb73c1b7 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: introduce per-subvolume delalloc inode list

When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b0feb9d9 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: introduce grab/put functions for the root of the fs/file tree

The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# cb517eab 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cleanup the similar code of the fs root read

There are several functions whose code is similar, such as
btrfs_find_last_root()
btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# babbf170 14-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make the snap/subv deletion end more early when the fs is R/O

The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 1c89cdd1 11-May-2013 Andreas Philipp <philipp.andreas@gmail.com>

Minor format cleanup.

Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 57254b6e 06-May-2013 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: add ioctl to wait for qgroup rescan completion

btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 1e8f9158 06-May-2013 Wang Shilong <wangsl-fnst@cn.fujitsu.com>

Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist

When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.

By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below

fsstress -p 4 -n 10000 -d $dir

With this patch:

real 0m50.153s
user 0m0.081s
sys 0m6.294s

real 0m51.113s
user 0m0.092s
sys 0m6.220s

real 0m52.610s
user 0m0.096s
sys 0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:

real 0m54.825s
user 0m0.061s
sys 0m10.665s

real 1m6.401s
user 0m0.089s
sys 0m11.218s

real 1m13.768s
user 0m0.087s
sys 0m10.665s avg 10.849

we can see the sys time reduce ~43%.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b1c79e09 09-May-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: handle running extent ops with skinny metadata

Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata. We also lose
the level of the metadata we are working on. So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 60b62978 30-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: annotate quota tree for lockdep

Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.

There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 1104a885 06-Mar-2013 David Sterba <dsterba@suse.cz>

btrfs: enhance superblock checks

The superblock checksum is not verified upon mount. <awkward silence>

Add that check and also reorder existing checks to a more logical
order.

Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.

First transaction commit calculates correct superblock checksum and
saves it to disk.

Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# b6919a58 29-Apr-2013 David Sterba <dsterba@suse.cz>

btrfs: fix misleading variable name for flags

The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 48a3b636 25-Apr-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 2f232036 25-Apr-2013 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: rescan for qgroups

If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.

A filesystem under rescan can still be umounted. The rescan continues on the
next mount. Status information is provided with a separate ioctl while a
rescan operation is in progress.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# fc36ed7e 24-Apr-2013 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: separate sequence numbers for delayed ref tracking and tree mod log

Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.

However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.

To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.

The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.

Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 5fbf83c1 19-Apr-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: delete unused parameter to btrfs_read_root_item()

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4b90c680 15-Apr-2013 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: remove unused argument of btrfs_extend_item()

Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# afe5fea7 15-Apr-2013 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: cleanup of function where fixup_low_keys() is called

If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# ceda0864 11-Apr-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use a lock to protect incompat/compat flag of the super block

The following case will make the incompat/compat flag of the super block
be recovered.
Task1 |Task2
flags = btrfs_super_incompat_flags(); |
|flags = btrfs_super_incompat_flags();
flags |= new_flag1; |
|flags |= new_flag2;
btrfs_set_super_incompat_flags(flags); |
|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.

In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# f2f6ed3d 07-Apr-2013 Wang Shilong <wangsl-fnst@cn.fujitsu.com>

Btrfs: introduce a mutex lock for btrfs quota operations

The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
->add_qgroup_info_item()
->add_qgroup_rb()

For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().

What's worse, when we want to add a qgroup relation:
->add_qgroup_relation_item()
->add_qgroup_relations()

We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.

To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 09a2a8f9 05-Apr-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix bad extent logging

A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# c2cf52eb 19-Mar-2013 Simon Kirby <sim@hostway.ca>

Btrfs: Include the device in most error printk()s

With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.

This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.

This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.

Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.

Signed-off-by: Simon Kirby <sim@hostway.ca>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 3173a18f 07-Mar-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: add a incompatible format change for smaller metadata extent refs

We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 08748810 12-Mar-2013 David Sterba <dsterba@suse.cz>

btrfs: clean up transaction abort messages

The transaction abort stacktrace is printed only once per module
lifetime, but we'd like to see it each time it happens per mounted
filesystem. Introduce a fs_state flag that records it.

Tweak the messages around abort:
* add error number to the first abort
* print the exact negative errno from btrfs_decode_error
* clean up btrfs_decode_error and callers
* no dots at the end of the messages

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d5c12070 28-Feb-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix wrong reserved space in qgroup during snap/subv creation

There are two problems in the space reservation of the snapshot/
subvolume creation.
- don't reserve the space for the root item insertion
- the space which is reserved in the qgroup is different with
the free space reservation. we need reserve free space for
7 items, but in qgroup reservation, we need reserve space only
for 3 items.

So we implement new metadata reservation functions for the
snapshot/subvolume creation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# dc81cdc5 20-Feb-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix remount vs autodefrag

If we remount the fs to close the auto defragment or make the fs R/O,
we should stop the auto defragment.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# cdb4c574 19-Feb-2013 Zach Brown <zab@redhat.com>

btrfs: define BTRFS_MAGIC as a u64 value

super.magic is an le64 but it's treated as an unterminated string when
compared against BTRFS_MAGIC which is defined as a string. Instead
define BTRFS_MAGIC as a normal hex value and use endian helpers to
compare it to the super's magic.

I tested this by mounting an fs made before the change and made sure
that it didn't introduce sparse errors. This matches a similar cleanup
that is pending in btrfs-progs. David Sterba pointed out that we should
fix the kernel side as well :).

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 569e0f35 13-Feb-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: place ordered operations on a per transaction list

Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 210549eb 09-Feb-2013 David Sterba <dsterba@suse.cz>

btrfs: add cancellation points to defrag

The defrag operation can take very long, we want to have a way how to
cancel it. The code checks for a pending signal at safe points in the
defrag loops and returns EAGAIN. This means a user can press ^C after
running 'btrfs fi defrag', woks for both defrag modes, files and root.

Returning from the command was instant in my light tests, but may take
longer depending on the aging factor of the filesystem.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 5d80366e 07-Feb-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: steal from global reserve if we are cleaning up orphans

Sometimes xfstest 83 will fail to remount the scratch device because we've
gotten ourselves so full that we cannot cleanup the orphan items. In this
case check to see if we're doing the orphan cleanup and if we are allow us
to steal our reservation from the global block rsv. With this patch I've
not been able to reproduce the failed mount problem. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# de78b51a 31-Jan-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: remove cache only arguments from defrag path

The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell. Chris says they're dead code, so remove them.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# aa43a17c 30-Jan-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: handle null fs_info in btrfs_panic()

At least backref_tree_panic() can apparently pass
in a null fs_info, so handle that in __btrfs_panic
to get the message out on the console.

The btrfs_panic macro also uses fs_info, but that's
largely pointless; it's testing to see if
BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set.
But if it *were* set, __btrfs_panic() would have,
well, paniced and we wouldn't be here, testing it!
So just BUG() at this point.

And since we only use fs_info once now, just use it
directly.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 87533c47 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use bit operation for ->fs_state

There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.

Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.

Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# de98ced9 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits

There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 963d678b 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use percpu counter for fs_info->delalloc_bytes

fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.

This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e2d84521 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use percpu counter for dirty metadata count

->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.

This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# c018daec 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: protect fs_info->alloc_start

fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)

For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits

Task1 will get 0.

This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 8c6a3ee6 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: add a comment for fs_info->max_inline

Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 55e301fd 28-Jan-2013 Filipe Brandenburger <filbranden@google.com>

Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h

The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e6ec716f 16-Jan-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make raid attr array more readable

The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# a1897fdd 27-Dec-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: record first logical byte in memory

This'd save us a rbtree search which may become expensive in large filesystem.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# dcfac415 27-Dec-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay

Argument 'trans' is not used any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 2ab28f32 12-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: wait on ordered extents at the last possible moment

Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4ae10b3a 31-Jan-2013 Chris Mason <chris.mason@fusionio.com>

Btrfs: Add a stripe cache to raid56

The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:

* They are read in during a read/modify/write cycle

* They are written during a read/modify/write cycle

* They are involved in a parity rebuild

Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.

This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.

The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.

Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct

Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 53b381b3 29-Jan-2013 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: RAID5 and RAID6

This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 64a16701 15-Jul-2009 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: add rw argument to merge_bio_hook()

We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9c52057c 17-Dec-2012 Chris Mason <chris.mason@fusionio.com>

Btrfs: fix hash overflow handling

The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland. For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.

Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped. The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.

This fixes a few problem spots. First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.

btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite. Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.

Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Pascal Junod <pascal@junod.info>


# 31e50229 21-Nov-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: put raid properties into global table

Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# ad914559 15-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: don't memset new tokens

Our token logic depends on token->kaddr being set, and if it is not it sets
everything properly as needed. So instead of memsetting just set
token->kaddr to NULL. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 70c8a91c 11-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: log changed inodes based on the extent map tree

We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# d6393786 12-Dec-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: add path->really_keep_locks

You'd think path->keep_locks would keep all the locks wouldn't you? You'd
be wrong. It only keeps them if the slot is pointing to the last item in
the node. This is for use with btrfs_next_leaf, which needs this sort of
thing. But the horrible horrible things I'm going to do to the tree log
means I really need everything held from root to leaf so I can add and
delete items in the same search. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5f3ab90a 07-Dec-2012 Anand Jain <anand.jain@oracle.com>

Btrfs: rename root_times_lock to root_item_lock

Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 26176e7c 26-Nov-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: restructure btrfs_run_defrag_inodes()

This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9247f317 26-Nov-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: use slabs for auto defrag allocation

The auto defrag allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.

And besides that, it can do check for leaked objects when the module is removed.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 72d7aefc 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

This change of the define is effective in all modes, it
is required and used only in the case when a device replace
procedure is running. The reason is that during an active
device replace procedure, the target device of the copy
operation is a mirror for the filesystem data as well that
can be used to read data in order to repair read errors on
other disks.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 29a8d9a0 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()

Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.

The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# e93c89c1 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add new sources for device replace code

This adds a new file to the sources together with the header file
and the changes to ioctl.h and ctree.h that are required by the
new C source file. Additionally, 4 new functions are added to
volume.c that deal with device creation and destruction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# ff023aac 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add code to scrub to copy read data to another disk

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit adds code to scrub to allow the scrub code to copy read
data to another disk.
One goal is to be able to perform as fast as possible. Therefore the
write requests are collected until huge bios are built, and the
write process is decoupled from the read process with some kind of
flow control, of course, in order to limit the allocated memory.
The best performance on spinning disks could by reached when the
head movements are avoided as much as possible. Therefore a single
worker is used to interface the read process with the write process.
The regular scrub operation works as fast as before, it is not
negatively influenced and actually it is more or less unchanged.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 63a212ab 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: disallow some operations on the device replace target device

This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5ac00add 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: disallow mutually exclusive admin operations from user mode

Btrfs admin operations that are manually started from user mode
and that cannot be executed at the same time return -EINPROGRESS.
A common way to enter and leave this locked section is introduced
since it used to be specific to the balance operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# a2bff640 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: introduce a btrfs_dev_replace_item type

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# e922e087 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: enhance btrfs structures for device replace support

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# aa1b8cd4 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: pass fs_info instead of root

A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 315a9850 01-Nov-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix wrong file extent length

There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# d1423248 31-Oct-2012 Masanari Iida <standby24x7@gmail.com>

Btrfs: Fix typo in fs/btrfs

Correct spelling typo in btrfs.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 8ccf6f19 25-Oct-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make delalloc inodes be flushed by multi-task

This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 08e007d2 16-Oct-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: improve the noflush reservation

In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 926ccfef 31-Oct-2012 Masanari Iida <standby24x7@gmail.com>

Btrfs: Fix printk and variable name

Correct spelling typo in btrfs.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# be6aef60 22-Oct-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: Use btrfs_update_inode_fallback when creating a snapshot

On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot. Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 5b6602e7 23-Oct-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: determine level of old roots

In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 5af3e8cc 01-Aug-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: make filesystem read-only when submitting barrier fails

So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.

In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.

The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# f186373f 08-Aug-2012 Mark Fasheh <mfasheh@suse.de>

btrfs: extended inode refs

This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.

Inode creation does not need changing - extended refs are only added after
the ref array is full.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>


# 005d6427 18-Sep-2012 David Sterba <dsterba@suse.cz>

btrfs: move transaction aborts to the point of failure

Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 2e90cf85 14-Sep-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: cleanup fs_info->hashers

fs_info->hashers is now an obsolete one.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>


# ea658bad 11-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: delay block group item insertion

So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 0647d6bd 07-Sep-2012 liubo <liubo2009@cn.fujitsu.com>

Btrfs: cleanup for unused ref cache stuff

As ref cache has been removed from btrfs, there is no user on
its lock and its check.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>


# 2ecb7923 06-Sep-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix unprotected ->log_batch

We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 66d8f3dd 06-Sep-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: add a new "type" field into the block reservation structure

Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 7014cdb4 30-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: btrfs_drop_extent_cache should never fail

I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent. This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given. This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop. This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 2aaa6655 29-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: add hole punching

This patch adds hole punching via fallocate. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 2671485d 28-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: remove unused hint byte argument for btrfs_drop_extents

I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument. I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# ca7e70f5 27-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: do not needlessly restart the transaction for enospc

We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 5dc562c5 17-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: turbo charge fsync

At least for the vm workload. Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s

So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 527a1361 05-Sep-2012 Wang Sheng-Hui <shhuiw@gmail.com>

btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID

It should be storing, not sotring.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 1fa11e26 06-Aug-2012 Arne Jansen <sensille@gmx.net>

Btrfs: fix deadlock in wait_for_more_refs

Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# c329861d 03-Aug-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: don't allocate a seperate csums array for direct reads

We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range. But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums. This allows us to avoid an extra memory allocation
for direct reads which could incur latency. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 533574c6 30-Jul-2012 Joe Perches <joe@perches.com>

btrfs: use printk_get_level and printk_skip_level, add __printf, fix fallout

Use the generic printk_get_level() to search a message for a kern_level.

Add __printf to verify format and arguments. Fix a few messages that
had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks
to shrink the object size a bit when not using printk.

[akpm@linux-foundation.org: whitespace tweak]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7069830a 05-Jun-2012 Alexander Block <ablock84@googlemail.com>

Btrfs: add btrfs_compare_trees function

This function is used to find the differences between
two trees. The tree compare skips whole subtrees if it
detects shared tree blocks and thus is pretty fast.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>


# 8ea05e3a 25-Jul-2012 Alexander Block <ablock84@googlemail.com>

Btrfs: introduce subvol uuids and times

This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>


# 2b0ce2c2 24-Jul-2012 Mitch Harder <mitch.harder@sabayonlinux.org>

Btrfs: Check INCOMPAT flags on remount and add helper function

In support of the recently added capability to remount with lzo
compression, provide a helper function to check the compression
INCOMPAT flags when remounting with lzo compression, and set
the flags if necessary.

Also, implement the new helper function when defragmenting with
explicit lzo compression and when setting the default subvolume.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# e6793769 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: add helper for tree enumeration

Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 18077bb4 09-Jul-2012 Li Zefan <lizefan@huawei.com>

Btrfs: rewrite BTRFS_SETGET_FUNCS

BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
btrfs_foo() functions, which read and write specific fields in the
extent buffer.

The total number of set/get functions is ~200, but in fact we only
need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.

It results in redunction of ~37K bytes.

text data bss dec hex filename
629661 12489 216 642366 9cd3e fs/btrfs/btrfs.o.orig
592637 12489 216 605342 93c9e fs/btrfs/btrfs.o

Signed-off-by: Li Zefan <lizefan@huawei.com>


# b4d7c3c9 09-Jul-2012 Li Zefan <lizefan@huawei.com>

Btrfs: kill free_space pointer from inode structure

Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

Signed-off-by: Li Zefan <lizefan@huawei.com>


# bcef60f2 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: quota tree support and startup

Init the quota tree along with the others on open_ctree
and close_ctree. Add the quota tree to the list of well
known trees in btrfs_read_fs_root_no_name.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# bed92eae 28-Jun-2012 Arne Jansen <sensille@gmx.net>

Btrfs: qgroup implementation and prototypes

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 416ac51d 12-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: qgroup state and initialization

Add state to fs_info.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 2f38b3e1 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: add helper for tree enumeration

Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 630dc772 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: qgroup on-disk format

Not all features are in use by the current version
and thus may change in the future.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 097b8a7c 21-Jun-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: join tree mod log code with the code holding back delayed refs

We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.

Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 1c8f52a5 19-Jun-2012 Alexander Block <ablock84@googlemail.com>

Btrfs: introduce btrfs_next_old_item

We introduce btrfs_next_old_item that uses btrfs_next_old_leaf instead
of btrfs_next_leaf.

btrfs_next_item is also changed to simply call btrfs_next_old_item with
time_seq being 0.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 3d7806ec 11-Jun-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: add btrfs_next_old_leaf

To make sense of the tree mod log, the backref walker not only needs
btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
calling btrfs_search_slot. This obviously didn't give the correct result.

This commit adds btrfs_next_old_leaf, a drop-in replacement for
btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
with this time_seq parameter.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# e41f941a 26-Mar-2012 Josef Bacik <josef@redhat.com>

Btrfs: move over to use ->update_time

Btrfs had been doing it's own file_update_time so we could catch ENOSPC
properly, so just update our btrfs_update_time to work with the new stuff and
then we'll be fancy later. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 95a06077 29-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: use delayed ref sequence numbers for all fs-tree updates

The sequence number for delayed refs is needed to postpone certain delayed
refs for a very short period while walking backrefs. Before the tree
modification log, we thought we'd only have to hold back those references
that don't have a counter operation.

While now we've the tree mod log, we're rewinding fs tree blocks to a
defined consistent state. We cannot know in advance for which tree block
we'll be doing rewind operations later. Therefore, we must postpone all the
delayed refs for fs-tree blocks, even those having a counter operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 3d136a11 03-Feb-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: set ioprio of scrub readahead to idle

Reduce ioprio class of scrub readahead threads to idle priority.
This setting is fixed. This priority has shown the best performance
during all measurements.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 733f4fbb 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: read device stats on mount, write modified ones during commit

The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 8a35d95f 23-May-2012 Josef Bacik <josef@redhat.com>

Btrfs: fix how we deal with the orphan block rsv

Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Btrfs: fix how we deal with the orphan block rsv

Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 5d9e75c4 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: add btrfs_search_old_slot

The tree modification log together with the current state of the tree gives
a consistent, old version of the tree. btrfs_search_old_slot is used to
search through this old version and return old (dummy!) extent buffers.
Naturally, this function cannot do any tree modifications.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# bd989ba3 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: add tree modification log functions

The tree mod log will log modifications made fs-tree nodes. Most
modifications are done by autobalance of the tree. Such changes are recorded
as long as a block entry exists. When released, the log is cleaned.

With the tree modification log, it's possible to reconstruct a consistent
old state of the tree. This is required to do backref walking on a busy
file system.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# f29021b2 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: add tree mod log to fs_info

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 64947ec0 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: move struct seq_list to ctree.h

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 5581a51a 16-May-2012 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: don't set for_cow parameter for tree block functions

Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.

Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 25cd999e 30-Mar-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: fix that check_int_data mount option was ignored

The bitfield member mount_opt was too small by one bit to hold the mount
option that enabled to include data extents in the integrity checker.
Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR
option was added (git rebase silently merges so that the increase of the
size of the bitfield member is lost), the bit limit was removed entirely.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 6ed3cf2c 13-Apr-2012 Al Viro <viro@zeniv.linux.org.uk>

btrfs: btrfs_root_readonly() broken on big-endian

->root_flags is __le64 and all accesses to it go through the helpers
that do proper conversions. Except for btrfs_root_readonly(), which
checks bit 0 as in host-endian...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 94598ba8 27-Mar-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: introduce common define for max number of mirrors

Readahead already has a define for the max number of mirrors. Scrub
needs such a define now, the rest of the code will need something
like this soon. Therefore the define was added to ctree.h and removed
from the readahead code.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0c460c0d 27-Mar-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: move alloc_profile_is_valid() to volumes.c

Header file is not a good place to define functions. This also moves a
call to alloc_profile_is_valid() down the stack and removes a redundant
check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it
into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e8920a64 27-Mar-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: make profile_is_valid() check more strict

"0" is a valid value for an on-disk chunk profile, but it is not a valid
extended profile. (We have a separate bit for single chunks in extended
case)

Also rename it to alloc_profile_is_valid() for clarity.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 899c81ea 27-Mar-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add wrappers for working with alloc profiles

Add functions to abstract the conversion between chunk and extended
allocation profile formats and switch everybody to use them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cfed81a0 03-Mar-2012 Chris Mason <chris.mason@oracle.com>

Btrfs: add the ability to cache a pointer into the eb

This cuts down on the CPU time used by map_private_extent_buffer

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 727011e0 06-Aug-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: allow metadata blocks larger than the page size

A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling. This fixes the code to properly support
large metadata blocks again.

Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.

This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 81c9ad23 18-Jan-2012 Josef Bacik <josef@redhat.com>

Btrfs: remove search_start and search_end from find_free_extent and callers

We have been passing nothing but (u64)-1 to find_free_extent for search_end in
all of the callers, so it's completely useless, and we've always been passing 0
in as search_start, so just remove them as function arguments and move
search_start into find_free_extent. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 49b25e05 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: enhance transaction abort infrastructure

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 4da35113 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: add varargs to btrfs_error

btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.

This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 2c536799 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_drop_snapshot should return int

Commit cb1b69f4 (Btrfs: forced readonly when btrfs_drop_snapshot() fails)
made btrfs_drop_snapshot return void because there were no callers checking
the return value. That is the wrong order to handle error propogation since
the caller will have no idea that an error has occured and continue on
as if nothing went wrong.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 143bede5 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: return void in functions without error conditions

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# b45a9d8b 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_update_root error push-up

btrfs_update_root BUG's when it can't alloc a path, yet it can recover
from a search error. This patch returns -ENOMEM instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 8c342930 03-Oct-2011 Jeff Mahoney <jeffm@suse.com>

btrfs: Add btrfs_panic()

As part of the effort to eliminate BUG_ON as an error handling
technique, we need to determine which errors are actual logic errors,
which are on-disk corruption, and which are normal runtime errors
e.g. -ENOMEM.

Annotating these error cases is helpful to understand and report them.

This patch adds a btrfs_panic() routine that will either panic
or BUG depending on the new -ofatal_errors={panic,bug} mount option.
Since there are still so many BUG_ONs, it defaults to BUG for now but I
expect that to change once the error handling effort has made
significant progress.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# c08782da 26-Jan-2012 David Sterba <dsterba@suse.cz>

btrfs: fix structs where bitfields and spinlock/atomic share 8B word

On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
disaterous effect.

https://lkml.org/lkml/2012/2/1/220

Signed-off-by: David Sterba <dsterba@suse.cz>


# a7e99c69 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: allow for canceling restriper

Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 837d5b6e 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: allow for pausing restriper

Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 9555c6c1 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add skip_balance mount option

Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it. The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0940ebf6 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: save balance parameters to disk

Introduce a new btree objectid for storing balance item. The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.

The key for the new item is as follows:

[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 70922617 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: do not reduce profile in do_chunk_alloc()

Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time. Instead check the
validity of the passed profile.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c9e9f97b 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add basic restriper infrastructure

Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a46d11a8 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit

Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.

This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 52ba6929 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: introduce masks for chunk type and profile

Chunk's type and profile are encoded in u64 flags field. Introduce
masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6fef8df1 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: get rid of *_alloc_profile fields

{data,metadata,system}_alloc_profile fields have been unused for a long
time now. Get rid of them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 815745cf 17-Nov-2011 Al Viro <viro@zeniv.linux.org.uk>

btrfs: let ->s_fs_info point to fs_info, not root...

the latter can be obtained from the former (by looking as ->tree_root)
just as cheaply as we currently are doing the other way round.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 66d7e7f0 12-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: mark delayed refs as for cow

Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# c7d22a3c 22-Nov-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

Btrfs: added helper btrfs_next_item()

btrfs_next_item() makes the btrfs path point to the next item, crossing leaf
boundaries if needed.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# 21adbd5c 09-Nov-2011 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: integrate integrity check module into btrfs

This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 22c44fe6 30-Nov-2011 Josef Bacik <josef@redhat.com>

Btrfs: deal with enospc from dirtying inodes properly

Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC. This needs to be fixed in a few areas

1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty. So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.

2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't
setting ->setattr for symlinks, even though we should have been. This catches
one of the cases where we were getting errors in mark_inode_dirty.

3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty. This lets us return errors properly for truncate
and chown/anything related to setattr.

4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one. The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.

With this patch xfstests 83 complains a handful of times instead of hundreds of
times. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# aa38a711 18-Nov-2011 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix deadlock on metadata reservation when evicting a inode

When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
start transaction
|
v
reserve meta-data space
|
v
flush delay allocation -> iput inode -> evict inode
^ |
| v
wait for delay allocation flush <- reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 291c7d2f 14-Nov-2011 Josef Bacik <josef@redhat.com>

Btrfs: wait on caching if we're loading the free space cache

We've been hitting panics when running xfstest 13 in a loop for long periods of
time. And actually this problem has always existed so we've been hitting these
things randomly for a while. Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go. Then we get another thread that comes
in and tries to allocate from that block group. Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish. The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.

The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# f1ebcc74 14-Nov-2011 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush

The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
and this may change the src root when processing the following pending
snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c06a0e12 04-Nov-2011 Josef Bacik <josef@redhat.com>

Btrfs: fix delayed insertion reservation

We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.

The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.

With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6d668dda 03-Nov-2011 Josef Bacik <josef@redhat.com>

Btrfs: make a delayed_block_rsv for the delayed item insertion

I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff. It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff. So instead create a seperate block rsv for the delayed
insertion stuff. This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# af31f5e5 03-Nov-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: add a log of past tree roots

This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.

It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6c41761f 13-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: separate superblock items out of fs_info

fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>


# e688b725 31-Oct-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: fix extent pinning bugs in the tree log

The tree log had two important bugs that could cause corruptions after a
crash. Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.

This allowed a future metadata write to overwrite the tree log data. It
is fixed by adding a new variant of freeing reserved extents that always
pins them. Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.

During tree log replay, we do a pass through the tree log and pin all
the extents we find. This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay. The problem
is the free space cache isn't honoring these pinned extents. So the
allocator can end up handing them out, leading to all kinds of problems
during replay.

The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 36ba022a 17-Oct-2011 Josef Bacik <josef@redhat.com>

Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions

Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it. However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction. So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 5b0e95bf 06-Oct-2011 Josef Bacik <josef@redhat.com>

Btrfs: inline checksums into the disk free space cache

Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back. The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this? This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen. So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache. The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine. So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on. This survives
xfstests and stress.sh.

The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 2bf64758 26-Sep-2011 Josef Bacik <josef@redhat.com>

Btrfs: allow us to overcommit our enospc reservations

One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test

go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 3b16a4e3 21-Sep-2011 Josef Bacik <josef@redhat.com>

Btrfs: use the inode's mapping mask for allocating pages

Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,

Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>


# 4a92b1b8 29-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: stop passing a trans handle all around the reservation code

The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 482e6dc5 19-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check

If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount. This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed. But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes. Now xfstests 224 runs fine without all those
complaints. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 07127184 19-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: reduce the amount of space needed for truncates

With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again. This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space. Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 5e962c78 08-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill btrfs_truncate_reserve_metadata

Since we've optimized the truncate path, we no longer require this function.

Signed-off-by: Josef Bacik <josef@redhat.com>


# dabdb640 07-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill unused parts of block_rsv

The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 37be25bc 05-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill the durable block rsv stuff

This is confusing code and isn't used by anything anymore, so delete it.

Signed-off-by: Josef Bacik <josef@redhat.com>


# dba68306 04-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill the orphan space calculation for snapshots

This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# fb25e914 26-Jul-2011 Josef Bacik <josef@redhat.com>

Btrfs: use bytes_may_use for all ENOSPC reservations

We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.

This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 7414a03f 23-May-2011 Arne Jansen <sensille@gmx.net>

btrfs: initial readahead code and prototypes

This is the implementation for the generic read ahead framework.

To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).

The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.

Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock

Changes v3:
- protect root->node by rcu

Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 90519d66 23-May-2011 Arne Jansen <sensille@gmx.net>

btrfs: state information for readahead

Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>


# c97c2916 03-Aug-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: use plain page_address() in header fields setget functions

We've stopped using highmem for extent buffers.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cb1b69f4 09-Aug-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com>

Btrfs: forced readonly when btrfs_drop_snapshot() fails

The filesystem turns readonly instead of returning the error to the
caller when detected error in btrfs_drop_snapshot().
and, because the caller doesn't check the error, the function type is
changed to 'void'.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9b89d95a 13-Jul-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: make acl functions really no-op if acl is not enabled

So there's no overhead for something we don't use.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bf5f32ec 14-Jul-2011 Mark Fasheh <mfasheh@suse.com>

btrfs: make btrfs_set_root_node void

This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void. All callers ignore the return code now anyway. I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b6973aa6 19-Jul-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: fix readahead in file defrag

We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bd681513 16-Jul-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: switch the btrfs tree locks to reader/writer

The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9e0baf60 15-Jul-2011 Josef Bacik <josef@redhat.com>

Btrfs: fix enospc problems with delalloc

So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea. Consider this where we have 1
outstanding extent and 1 reserved extent

Reserver Releaser
atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
free reserved space for 1 extent

Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bab39bf9 30-Jun-2011 Josef Bacik <josef@redhat.com>

Btrfs: use a worker thread to do caching

A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,

Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>


# 4e34e719 23-Jul-2011 Christoph Hellwig <hch@lst.de>

fs: take the ACL checks to common code

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 02c24a82 16-Jul-2011 Josef Bacik <josef@redhat.com>

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b2675157 18-Jul-2011 Josef Bacik <josef@redhat.com>

Btrfs: implement our own ->llseek

In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0ee5dc67 07-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

btrfs: kill magical embedded struct superblock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7e40145e 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to ->check_acl()

not used in the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fdb5effd 07-Jun-2011 Josef Bacik <josef@redhat.com>

Btrfs: serialize flushers in reserve_metadata_bytes

We keep having problems with early enospc, and that's because our method of
making space is inherently racy. The problem is we can have one guy trying to
make space for himself, and in the meantime people come in and steal his
reservation. In order to stop this we make a waitqueue and put anybody who
comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
make more space. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# b5009945 07-Jun-2011 Josef Bacik <josef@redhat.com>

Btrfs: do transaction space reservation before joining the transaction

We have to do weird things when handling enospc in the transaction joining code.
Because we've already joined the transaction we cannot commit the transaction
within the reservation code since it will deadlock, so we have to return EAGAIN
and then make sure we don't retry too many times. Instead of doing this, just
do the reservation the normal way before we join the transaction, that way we
can do whatever we want to try and reclaim space, and then if it fails we know
for sure we are out of space and we can return ENOSPC. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0942caa3 28-Jun-2011 David Sterba <dsterba@suse.cz>

btrfs: add missing options displayed in mount output

There are three missed mount options settable by user which are not
currently displayed in mount output.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 46e4edbf 23-Jun-2011 Jesper Juhl <jj@chaosbits.net>

Remove unneeded version.h includes from fs/

It was pointed out by 'make versioncheck' that some includes of
linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and
fs/omfs/file.c).

This patch removes them.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9fe6a50f 16-Jun-2011 Maarten Lankhorst <m.b.lankhorst@gmail.com>

btrfs: Remove unused sysfs code

Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.

Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7585717f 13-Jun-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: fix relocation races

The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.

This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7841cb28 31-May-2011 David Sterba <dsterba@suse.cz>

btrfs: add helper for fs_info->closing

wrap checking of filesystem 'closing' flag and fix a few missing memory
barriers.

Signed-off-by: David Sterba <dsterba@suse.cz>


# 4b9465cb 03-Jun-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: add mount -o inode_cache

This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# aa385729 27-May-2011 Christoph Hellwig <hch@infradead.org>

fs: pass exact type of data dirties to ->dirty_inode

Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.

This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.

Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4cb5300b 24-May-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: add mount -o auto_defrag

This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0956c798 17-May-2011 Andi Kleen <ak@linux.intel.com>

BTRFS: Remove unused node_lock

240f62c8756 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d82a6f1d 11-May-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill BTRFS_I(inode)->block_group

Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# fcb80c2a 03-May-2011 Josef Bacik <josef@redhat.com>

Btrfs: fix how we do space reservation for truncate

The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up. This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen. This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen. So in order
to fix this we need to split all of the reservation stuf up. So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# a4abeea4 11-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill trans_mutex

We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 16cdcec7 22-Apr-2011 Miao Xie <miaox@cn.fujitsu.com>

btrfs: implement delayed inode items operation

Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030

After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4ea02885 12-May-2011 David Sterba <dsterba@suse.cz>

btrfs: use unsigned type for single bit bitfield

Signed-off-by: David Sterba <dsterba@suse.cz>


# 8628764e 23-Mar-2011 Arne Jansen <sensille@gmx.net>

btrfs: add readonly flag

setting the readonly flag prevents writes in case an error is detected

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 475f6387 11-Mar-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

btrfs: new ioctls for scrub

adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# a2de733c 08-Mar-2011 Arne Jansen <sensille@gmx.net>

btrfs: scrub

This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# f2a97a9d 04-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove all unused functions

Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>


# 621496f4 03-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove unused function prototypes

function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>


# b3b4aa74 20-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: drop unused parameter from btrfs_release_path

parameter tree root it's not used since commit
5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>


# 306e16ce 19-Apr-2011 David Sterba <dsterba@suse.cz>

btrfs: rename variables clashing with global function names

reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>


# e9c54999 27-Apr-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi>

Revert wrong fixes for common misspellings

These changes were incorrectly fixed by codespell. They were now
manually corrected.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>


# 82d5902d 19-Apr-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Support reading/writing on disk free ino cache

This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 581bb050 19-Apr-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Cache free inode numbers in memory

Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 34d52cb6 28-Mar-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Make free space cache code generic

So we can re-use the code to cache free inode numbers.

The change is quite straightforward. Two new structures are introduced.

- struct btrfs_free_space_ctl

We move those variables that are used for caching free space from
struct btrfs_block_group_cache to this new struct.

- struct btrfs_free_space_op

We do block group specific work (e.g. calculation of extents threshold)
through functions registered in this struct.

And then we can remove references to struct btrfs_block_group_cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 6d74119f 11-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: avoid taking the chunk_mutex in do_chunk_alloc

Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all. So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>


# be1a12a0 06-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: deal with the case that we run out of space in the cache

Currently we don't handle running out of space in the cache, so to fix this we
keep track of how far in the cache we are. Then we only dirty the pages if we
successfully modify all of them, otherwise if we have an error or run out of
space we can just drop them and not worry about the vm writing them out.
Thanks,

Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Josef Bacik <josef@redhat.com>


# 08fe4db1 27-Mar-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Fix uninitialized root flags for subvolumes

root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.

To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.

Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c59021f8 06-Mar-2011 liubo <liubo2009@cn.fujitsu.com>

Btrfs: fix OOPS of empty filesystem after balance

btrfs will remove unused block groups after balance.
When a empty filesystem is balanced, the block group with tag "DATA" may be
dropped, and after umount and mount again, it will not find "DATA" space_info
and lead to OOPS.
So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS.

Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f7039b1d 24-Mar-2011 Li Dongyang <lidongyang@novell.com>

Btrfs: add btrfs_trim_fs() to handle FITRIM

We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5378e607 24-Mar-2011 Li Dongyang <lidongyang@novell.com>

Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes

Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b4d00d56 24-Mar-2011 Li Dongyang <lidongyang@novell.com>

Btrfs: make update_reserved_bytes() public

Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 75e7cb7f 22-Mar-2011 Liu Bo <liubo2009@cn.fujitsu.com>

Btrfs: Per file/directory controls for COW and compression

Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.

NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).

v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1abe9b8a 24-Mar-2011 liubo <liubo2009@cn.fujitsu.com>

Btrfs: add initial tracepoint support for btrfs

Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
btrfs_sync_file
btrfs_sync_fs

These show sync arguments.

6) transaction:
btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4e69b598 21-Mar-2011 Josef Bacik <josef@redhat.com>

Btrfs: cleanup how we setup free space clusters

This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it. Currently we
either want to refill a cluster with

1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space

The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use. So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.

This also fixes a suboptimal thing we would do with bitmaps. If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation. Instead of doing that now we just add it to the clusters
group.

I tested this with my ENOSPC tests and xfstests and it survived.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 22a94d44 16-Mar-2011 Josef Bacik <josef@redhat.com>

Btrfs: add checks to verify dir items are correct

We need to make sure the dir items we get are valid dir items. So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 66b4ffd1 31-Jan-2011 Josef Bacik <josef@redhat.com>

Btrfs: handle errors in btrfs_orphan_cleanup

If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# a41ad394 31-Jan-2011 Josef Bacik <josef@redhat.com>

Btrfs: convert to the new truncate sequence

->truncate() is going away, instead all of the work needs to be done in
->setattr(). So this converts us over to do this. It's fairly straightforward,
just get rid of our .truncate inode operation and call btrfs_truncate() directly
from btrfs_setsize. This works out better for us since truncate can technically
return ENOSPC, and before we had no way of letting anybody know. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# dc89e982 28-Jan-2011 Josef Bacik <josef@redhat.com>

Btrfs: use a slab for the free space entries

Since we alloc/free free space entries a whole lot, lets use a slab to keep
track of them. This makes some of my tests slightly faster. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 36e39c40 12-Mar-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: break out of shrink_delalloc earlier

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c87f08ca 16-Feb-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: allow balance to explicitly allocate chunks as it relocates

Btrfs device shrinking and balancing ends up reallocating all the blocks
in order to allow COW to move them to new destinations. It is somewhat
awkward in terms of ENOSPC because most of the enospc code is built
around the idea that some operation on a reference counted tree triggers
allocations in the non-reference counted trees.

This commit changes the balancing code to deal with enospc by trying to
allocate a new chunk. If that allocation succeeds, we go ahead and
retry whatever failed due to enospc.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 91435650 16-Feb-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: put ENOSPC debugging under a mount option

ENOSPC in btrfs is getting to the point where the extra debugging isn't
required. I've put it under mount -o enospc_debug just in case someone
is having difficult problems.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# acce952b 06-Jan-2011 liubo <liubo2009@cn.fujitsu.com>

Btrfs: forced readonly mounts on errors

This patch comes from "Forced readonly mounts on errors" ideas.

As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.

The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.

After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f8b18087 12-Jan-2011 Stefan Schmidt <stefan@datenfreihafen.org>

fs/btrfs: Fix build of ctree

Fix the build failure in some configurations:

CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2

caused by commit 57cc7215b708 ("headers: kobject.h redux")

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d07bcec 05-Jan-2011 Miao Xie <miaox@cn.fujitsu.com>

btrfs: fix wrong free space information of btrfs

When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.

# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10

It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.

This patch fixes it by calcing the free space that can be used to allocate
chunks.

Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1

This implementation is just likely fake chunk allocation.

After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f580eb09 12-Jan-2011 Stefan Schmidt <stefan@datenfreihafen.org>

fs/btrfs: Fix build of ctree

CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b74c79e9 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: provide rcu-walk aware permission i_ops

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b83cc969 20-Dec-2010 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Add readonly snapshots support

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

Implementation:

- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

Changelog for v3:

- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# a6fa6fae 25-Oct-2010 Li Zefan <lizf@cn.fujitsu.com>

btrfs: Add lzo compression support

Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications

Usage:

# mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
# mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt

"-o compress" without argument is still allowed for compatability.

Compatibility:

If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.

Performance:

The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.

(time in second)
lzo zlib nocompress
copy: 10.6 21.7 14.9
extract: 70.1 94.4 66.6

(data size in MB)
lzo zlib nocompress
copy: 185.87 108.69 394.49
extract: 193.80 132.36 381.21

Changelog:

v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 261507a0 16-Dec-2010 Li Zefan <lizf@cn.fujitsu.com>

btrfs: Allow to add new compression algorithm

Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 0410c94a 19-Nov-2010 Mariusz Kozlowski <mk@lab.zgora.pl>

btrfs: make 1-bit signed fileds unsigned

Fixes these sparse warnings:
fs/btrfs/ctree.h:811:17: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:812:20: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:813:19: error: dubious one-bit signed bitfield

Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4260f7c7 29-Oct-2010 Sage Weil <sage@newdream.net>

Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed

Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bb9c12c9 29-Oct-2010 Sage Weil <sage@newdream.net>

Btrfs: async transaction commit

Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 88c2ba3b 21-Sep-2010 Josef Bacik <josef@redhat.com>

Btrfs: Add a clear_cache mount option

If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 67377734 16-Sep-2010 Josef Bacik <josef@redhat.com>

Btrfs: add support for mixed data+metadata block groups

There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0cb59c99 01-Jul-2010 Josef Bacik <josef@redhat.com>

Btrfs: write out free space cache

This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0af3d00b 21-Jun-2010 Josef Bacik <josef@redhat.com>

Btrfs: create special free space cache inode

In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 8bb8ab2e 15-Oct-2010 Josef Bacik <josef@redhat.com>

Btrfs: rework how we reserve metadata bytes

With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC. So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing. The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves. This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 0019f10d 15-Oct-2010 Josef Bacik <josef@redhat.com>

Btrfs: re-work delalloc flushing

Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 89a55897 14-Oct-2010 Josef Bacik <josef@redhat.com>

Btrfs: fix df regression

The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system. This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 45321ac5 07-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

Make ->drop_inode() just return whether inode needs to be dropped

... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bd555975 07-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

convert btrfs to ->evict_inode()

NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7ea80859 26-May-2010 Christoph Hellwig <hch@lst.de>

drop unused dentry argument to ->fsync

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4b46fce2 23-May-2010 Josef Bacik <josef@redhat.com>

Btrfs: add basic DIO read/write support

This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)

1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.

2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.

3) Use ordered extents for the writes so that all of the

lock_extent()
lookup_ordered()

type checks continue to work.

4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.

I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3fd0a558 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Metadata ENOSPC handling for balance

This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:

1. Avoid COW tree leave in the phrase of merging tree.

2. Handle interaction with snapshot creation.

3. make the backref cache can live across transactions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# efa56464 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Pre-allocate space for data relocation

Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d68fc57b 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Metadata reservation for orphan inodes

reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8929ecfa 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Introduce global metadata reservation

Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0ca1f7ce 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Update metadata reservation for delayed allocation

Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a22285a6 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Integrate metadata reservation with start_transaction

Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f0486c68 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Introduce contexts for metadata reservation

Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5da9d01b 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Shrink delay allocated space in a synchronized

Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 424499db 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Kill allocate_wait in space_info

We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b742bb82 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Link block groups of different raid types

The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 287a0ab9 19-Mar-2010 Josef Bacik <josef@redhat.com>

Btrfs: kill max_extent mount option

As Yan pointed out, theres not much reason for all this complicated math to
account for file extents being split up into max_extent chunks, since they are
likely to all end up in the same leaf anyway. Since there isn't much reason to
use max_extent, just remove the option altogether so we have one less thing we
need to test.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# 91748467 28-Feb-2010 Akinobu Mita <akinobu.mita@gmail.com>

btrfs: use memparse

Use memparse() instead of its own private implementation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2ac55d41 03-Feb-2010 Josef Bacik <josef@redhat.com>

Btrfs: cache the extent state everywhere we possibly can V2

This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function. This gives me about a 3 mb/s boots on my random write test. I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written. I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state. I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1e701a32 11-Mar-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: add new defrag-range ioctl.

The btrfs defrag ioctl was limited to doing the entire file. This
commit adds a new interface that can defrag a specific range inside
the file.

It can also force compression on the file, allowing you to selectively
compress individual files after they were created, even when mount -o
compress isn't turned on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6ef5ed0d 11-Dec-2009 Josef Bacik <josef@redhat.com>

Btrfs: add ioctl and incompat flag to set the default mount subvol

This patch needs to go along with my previous patch. This lets us set the
default dir item's location to whatever root we want to use as our default
mounting subvol. With this we don't have to use mount -o subvol=<tree id>
anymore to mount a different subvol, we can just set the new one and it will
just magically work. I've done some moderate testing with this, mostly just
switching the default mount around, mounting subvols and the default mount at
the same time and such, everything seems to work. Thanks,

Older kernels would generally be able to still mount the filesystem with the
default subvolume set, but it would result in a different volume being mounted,
which could be an even more unpleasant suprise for users. So if you set your
default subvolume, you can't go back to older kernels. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 73f73415 04-Dec-2009 Josef Bacik <josef@redhat.com>

Btrfs: change how we mount subvolumes

This work is in preperation for being able to set a different root as the
default mounting root.

There is currently a problem with how we mount subvolumes. We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume. So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have

/
/snap1
/snap1/snap2

as your available volumes. Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2. To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list). This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.

In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to. For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>. I tested this out with the above scenario and it
worked perfectly. Thanks,

mount -o subvol operates inside the selected subvolid. For example:

mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt

/mnt will have the snap1 directory for the subvolume with id
256.

mount -o subvol=snap /dev/xxx /mnt

/mnt will be the snap directory of whatever the default subvolume
is.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 12534832 17-Dec-2009 Josef Bacik <josef@redhat.com>

Btrfs: make set/get functions for the super compat_ro flags use compat_ro

Our set/get functions for compat_ro_flags actually look at compat_flags. This
will mess any attempt to use compat flags up. The fix is obvious. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a9185b41 05-Mar-2010 Christoph Hellwig <hch@lst.de>

pass writeback_control to ->write_inode

This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a555f810 28-Jan-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount -o compress-force

The default btrfs mount -o compress mode will quickly back off
compressing a file if it notices that compression does not reduce the
size of the data being written. This can save considerable CPU because
all future writes to the file go through uncompressed.

But some files are both very large and have mixed data stored in
them. In that case, we want to add the ability to always try
compressing data before writing it.

This commit adds mount -o compress-force. A later commit will add
a new inode flag that does the same thing.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 86b9f2ec 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Fix per root used space accounting

The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 24bbcf04 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Add delayed iput

iput() can trigger new transactions if we are dropping the
final reference, so calling it in btrfs_commit_transaction
may end up deadlock. This patch adds delayed iput to avoid
the issue.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f34f57a3 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Pass transaction handle to security and ACL initialization functions

Pass transaction handle down to security and ACL initialization
functions, so we can avoid starting nested transactions

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c71bf099 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Avoid orphan inodes cleanup while replaying log

We do log replay in a single transaction, so it's not good to do unbound
operations. This patch cleans up orphan inodes cleanup after replaying
the log. It also avoids doing other unbound operations such as truncating
a file during replaying log. These unbound operations are postponed to
the orphan inode cleanup stage.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 920bbbfb 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Rewrite btrfs_drop_extents

Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
avoid calling lock_extent within transaction.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ad48fd75 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Add btrfs_duplicate_item

btrfs_duplicate_item duplicates item with new key, guaranteeing
the source item and the new items are in the same tree leaf and
contiguous. It allows us to split file extent in place, without
using lock_extent to prevent bookend extent race.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e244a0ae 14-Oct-2009 Christoph Hellwig <hch@lst.de>

Btrfs: add -o discard option

Enable discard by default is not a good idea given the the trim speed
of SSD prototypes we've seen, and the carecteristics for many high-end
arrays. Turn of discards by default and require the -o discard option
to enable them on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0eda294d 13-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix btrfs acl #ifdef checks

The btrfs acl code was #ifdefing for a define
that didn't exist. This correctly matches it
to the values used by the Kconfig file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 257c62e1 13-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: avoid tree log commit when there are no changes

rpm has a habit of running fdatasync when the file hasn't
changed. We already detect if a file hasn't been changed
in the current transaction but it might have been sent to
the tree-log in this transaction and not changed since
the last call to fsync.

In this case, we want to avoid a tree log sync, which includes
a number of synchronous writes and barriers. This commit
extends the existing tracking of the last transaction to change
a file to also track the last sub-transaction.

The end result is that rpm -ivh and -Uvh are roughly twice as fast,
and on par with ext3.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 82d339d9 09-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com>

Btrfs: constify dentry_operations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ff782e0a 08-Oct-2009 Josef Bacik <josef@redhat.com>

Btrfs: optimize fsync for the single writer case

This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
for new people to join the logging transaction if there is only ever 1 writer.
This helps a little bit with latency where we have something like RPM where it
will fdatasync every file it writes, and so waiting the 1 jiffie for every
fdatasync really starts to add up.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e3ccfa98 07-Oct-2009 Josef Bacik <josef@redhat.com>

Btrfs: async delalloc flushing under space pressure

This patch moves the delalloc flushing that occurs when we are under space
pressure off to a async thread pool. This helps since we only free up
metadata space when we actually insert the extent item, which means it takes
quite a while for space to be free'ed up if we wait on all ordered extents.
However, if space is freed up due to inline extents being inserted, we can
wake people who are waiting up early, and they can finish their work.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 32c00aff 08-Oct-2009 Josef Bacik <josef@redhat.com>

Btrfs: release delalloc reservations on extent item insertion

This patch fixes an issue with the delalloc metadata space reservation
code. The problem is we used to free the reservation as soon as we
allocated the delalloc region. The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out. This patch does 3 things,

1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.

This has been tested thoroughly to make sure we did not regress on
performance.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 61d92c32 02-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix deadlock on async thread startup

The btrfs async worker threads are used for a wide variety of things,
including processing bio end_io functions. This means that when
the endio threads aren't running, the rest of the FS isn't
able to do the final processing required to clear PageWriteback.

The endio threads also try to exit as they become idle and
start more as the work piles up. The problem is that starting more
threads means kthreadd may need to allocate ram, and that allocation
may wait until the global number of writeback pages on the system is
below a certain limit.

The result of that throttling is that end IO threads wait on
kthreadd, who is waiting on IO to end, which will never happen.

This commit fixes the deadlock by handing off thread startup to a
dedicated thread. It also fixes a bug where the on-demand thread
creation was creating far too many threads because it didn't take into
account threads being started by other procs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 828c0950 01-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com>

const: constify remaining file_operations

[akpm@linux-foundation.org: fix KVM]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 3baf0bed 29-Sep-2009 Chris Ball <cjb@laptop.org>

Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code

We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're
currently not using it and are testing CONFIG_FS_POSIX_ACL instead.
CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs".

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9ed74f2d 11-Sep-2009 Josef Bacik <josef@redhat.com>

Btrfs: proper -ENOSPC handling

At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying. Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents. This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code. This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic. This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation. It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook. These help
us keep track of when we merge delalloc extents together and split them
up. Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have. Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs. This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1b2da372 11-Sep-2009 Josef Bacik <josef@redhat.com>

Btrfs: account for space used by the super mirrors

As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's. Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters. This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info. This
makes sure that our free space caclucations will be completely accurate.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ba1bf481 11-Sep-2009 Josef Bacik <josef@redhat.com>

Btrfs: make balance code choose more wisely when relocating

Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.

V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.

-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.

-check to make sure the block group we are going to relocate isn't the last one
in that particular space

-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 76dda93c 21-Sep-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: add snapshot/subvolume destroy ioctl

This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4df27c4d 21-Sep-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: change how subvolumes are organized

btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.

The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.

This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 13a8a7c8 21-Sep-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: do not reuse objectid of deleted snapshot/subvol

The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.

Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 11833d66 11-Sep-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: improve async block group caching

This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.

This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a1ed835e 10-Sep-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix extent replacment race

Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones. There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 276e680d 30-Jul-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: preserve commit_root for async caching

The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.

During a commit, we have a loop where we update the extent allocation
tree root. We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.

Right now the commit_root pointer is changed inside this loop. It
is more correct to change the commit_root pointer only after all the
looping is done.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 68b38550 27-Jul-2009 Josef Bacik <josef@redhat.com>

Btrfs: change how we unpin extents

We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 817d52f8 13-Jul-2009 Josef Bacik <josef@redhat.com>

Btrfs: async block group caching

This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 96303081 13-Jul-2009 Josef Bacik <josef@redhat.com>

Btrfs: use hybrid extents+bitmap rb tree for free space

Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1bec1aed 22-Jul-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: fix definition of struct btrfs_extent_inline_ref

use __le64 instead of u64 in on-disk structure definition.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2c47e605 27-Jun-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: update backrefs while dropping snapshot

The new backref format has restriction on type of backref item. If a tree
block isn't referenced by its owner tree, full backrefs must be used for the
pointers in it. When a tree block loses its owner tree's reference, backrefs
for the pointers in it should be updated to full backrefs. Current
btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
general use.

This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a
problem in the restricted form btrfs_drop_snapshot is used today, but for
general snapshot deletion this update is required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5affd88a 08-Jun-2009 Al Viro <viro@zeniv.linux.org.uk>

switch btrfs to inode->i_acl

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7df336ec 10-Jun-2009 Al Viro <viro@ZenIV.linux.org.uk>

Fix btrfs when ACLs are configured out

... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6cbff00f 17-Apr-2009 Christoph Hellwig <hch@lst.de>

Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

Add support for the standard attributes set via chattr and read via
lsattr. Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c289811c 10-Jun-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: autodetect SSD devices

During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS. If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 451d7585 09-Jun-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: add mount -o ssd_spread to spread allocations out

Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in. It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5d4f98a2 10-Jun-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)

This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e980b50c 24-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: fix fallocate deadlock on inode extent lock

The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding. The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 97e728d4 21-Apr-2009 Josef Bacik <jbacik@redhat.com>

Btrfs: try to keep a healthy ratio of metadata vs data block groups

This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups. By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata. This
can be changed by specifying the metadata_ratio mount option.

This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation. By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d4a78947 02-Apr-2009 Wu Fengguang <fengguang.wu@intel.com>

Btrfs: fix typos in comments

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dccae999 02-Apr-2009 Sage Weil <sage@newdream.net>

Btrfs: add flushoncommit mount option

The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit. This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations). This was previously the behavior only when a snapshot is
created.

This is used by Ceph to ensure that completed writes make it to the
platter along with the metadata operations they are bound to (by
BTRFS_IOC_TRANS_{START,END}).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3a5e1404 02-Apr-2009 Sage Weil <sage@newdream.net>

Btrfs: notreelog mount option

Add a 'notreelog' mount option to disable the tree log (used by fsync,
O_SYNC writes). This is much slower, but the tree logging produces
inconsistent views into the FS for ceph.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fa9c0d79 03-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: rework allocation clustering

Because btrfs is copy-on-write, we end up picking new locations for
blocks very often. This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.

This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation. The end result is that we tend to write sequentially to
the drive.

This happens all the time for metadata and it happens for data
when mounted -o ssd. But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.

This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.

The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents. This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 04018de5 03-Apr-2009 Josef Bacik <jbacik@redhat.com>

Btrfs: kill the pinned_mutex

This patch removes the pinned_mutex. The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there. We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# 6226cb0a 03-Apr-2009 Josef Bacik <jbacik@redhat.com>

Btrfs: kill the block group alloc mutex

This patch removes the block group alloc mutex used to protect the free space
tree for allocations and replaces it with a spin lock which is used only to
protect the free space rb tree. This means we only take the lock when we are
directly manipulating the tree, which makes us a touch faster with
multi-threaded workloads.

This patch also gets rid of btrfs_find_free_space and replaces it with
btrfs_find_space_for_alloc, which takes the number of bytes you want to
allocate, and empty_size, which is used to indicate how much free space should
be at the end of the allocation.

It will return an offset for the allocator to use. If we don't end up using it
we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to
kill the alloc_mutex, since we need to make sure nobody else comes along and
takes our space.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# c2ec175c 31-Mar-2009 Nick Piggin <npiggin@suse.de>

mm: page_mkwrite change prototype to match fault

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5a3f23d5 31-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: add extra flushing for renames and truncates

Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.

This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.

If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.

Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).

For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).

For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 12fcfd22 24-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: tree logging unlink/rename fixes

The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log. This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done. The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged. For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2). After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log. This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant. Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir. After a crash the rm -rf must
be replayed. This must be able to recurse down the entire
directory tree. The inode link count fixup code takes care of the
ugly details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b9473439 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: leave btree locks spinning more often

btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c3e69d58 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: process the delayed reference queue in clusters

The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree. These are processed by
finding records in the tree that are not currently being processed one at
a time.

This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.

This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 56bec294 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: do extent allocation and reference count updates in the background

The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9fa8cfe7 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: don't preallocate metadata blocks during btrfs_search_slot

In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.

A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4184ea7f 09-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix locking around adding new space_info

Storage allocated to different raid levels in btrfs is tracked by
a btrfs_space_info structure, and all of the current space_infos are
collected into a list_head.

Most filesystems have 3 or 4 of these structs total, and the list is
only changed when new raid levels are added or at unmount time.

This commit adds rcu locking on the list head, and properly frees
things at unmount time. It also clears the space_info->full flag
whenever new space is added to the FS.

The locking for the space info list goes like this:

reads: protected by rcu_read_lock()
writes: protected by the chunk_mutex

At unmount time we don't need special locking because all the readers
are gone.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6a63209f 20-Feb-2009 Josef Bacik <jbacik@redhat.com>

Btrfs: add better -ENOSPC handling

This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# 4008c04a 12-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: make a lockdep class for the extent buffer locks

Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e00f7308 12-Feb-2009 Jeff Mahoney <jeffm@suse.com>

Btrfs: remove btrfs_init_path

btrfs_init_path was initially used when the path objects were on the
stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b4ce94de 04-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Change btree locking to use explicit blocking points

Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c487685d 04-Feb-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: hash_lock is no longer needed

Before metadata is written to disk, it is updated to reflect that writeout
has begun. Once this update is done, the block must be cow'd before it
can be modified again.

This update was originally synchronized by using a per-fs spinlock. Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.

So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7237f183 20-Jan-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: fix tree logs parallel sync

To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 95029d7d 21-Jan-2009 Jan Engelhardt <jengelh@medozas.de>

Btrfs: change/remove typedef

Change one typedef to a regular enum, and remove an unused one.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d397712b 05-Jan-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix checkpatch.pl warnings

There were many, most are fixed now. struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cad321ad 17-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: shift all end_io work to thread pools

bio_end_io for reads without checksumming on and btree writes were
happening without using async thread pools. This means the extent_io.c
code had to use spin_lock_irq and friends on the rb tree locks for
extent state.

There were some irq safe vs unsafe lock inversions between the delallock
lock and the extent state locks. This patch gets rid of them by moving
all end_io code into the thread pools.

To avoid contention and deadlocks between the data end_io processing and the
metadata end_io processing yet another thread pool is added to finish
off metadata writes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 17d217fe 12-Dec-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: fix nodatasum handling in balancing code

Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.

1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.

2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.

3) update data relocation code to properly handle the case
of checksum missing.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# d2fb3437 11-Dec-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: fix leaking block group on balance

The block group structs are referenced in many different
places, and it's not safe to free while balancing. So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 459931ec 10-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Delete csum items when freeing extents

This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent. Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock. This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c3027eb5 08-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add inode sequence number for NFS and reserved space in a few structs

This adds a sequence number to the btrfs inode that is increased on
every update. NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.

While we're here, this also:

Puts reserved space into the super block and inode

Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID. For now
the log root transid is always zero, but that'll get fixed.

Adds a starting offset to the dev_item. This will let us do better
alignment calculations if we know the start of a partition on the disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d20f7043 08-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: move data checksumming into a dedicated tree

Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2a7108ad 02-Dec-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: rev the disk format for the inode compat and csum selection changes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 607d432d 02-Dec-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: add support for multiple csum algorithms

This patch gives us the space we will need in order to have different csum
algorithims at some point in the future. We save the csum algorithim type
in the superblock, and use those instead of define's.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# f2b636e8 02-Dec-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: add support for compat flags to btrfs

This adds the necessary disk format for handling compatibility flags
in the future to handle disk format changes. We have a compat_flags,
compat_ro_flags and incompat_flags set for the super block. Compat
flags will be to hold the features that are compatible with older
versions of btrfs, compat_ro flags have features that are compatible
with older versions of btrfs if the fs is mounted read only, and
incompat_flags has features that are incompatible with older versions
of btrfs. This also axes the compat_flags field for the inode and
just makes the flags field a 64bit field, and changes the root item
flags field to 64bit.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# ea6a478e 19-Nov-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutex

This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# 73e9f5be 18-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Update the disk format for the seed device and new root code
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ea9e8b11 17-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: prevent loops in the directory tree when creating snapshots

For a directory tree:

/mnt/subvolA/subvolB

btrfsctl -s /mnt/subvolA/subvolB /mnt

Will create a directory loop with subvolA under subvolB. This
commit uses the forward refs for each subvol and snapshot to error out
before creating the loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0660b5af 17-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add backrefs and forward refs for subvols and snapshots

Subvols and snapshots can now be referenced from any point in the directory
tree. We need to maintain back refs for them so we can find lost
subvols.

Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol. This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3394e160 17-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Give each subvol and snapshot their own anonymous devid

Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.

This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.

btrfs_rename is changed to prevent renames across subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3de4586c 17-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Allow subvolumes and snapshots anywhere in the directory tree

Before, all snapshots and subvolumes lived in a single flat directory. This
was awkward and confusing because the single flat directory was only writable
with the ioctls.

This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree. This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.

The subvol ioctl does:

btrfsctl -S subvol_name parent_dir

After the ioctl is done subvol_name lives inside parent_dir.

The snapshot ioctl does:

btrfsctl -s path_for_snapshot root_to_snapshot

path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
into directory and basename components.

root_to_snapshot can be any file or directory in the FS. The snapshot
is taken of the entire root where that file lives.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2b82032c 17-Nov-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Seed device support

Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# c146afad 12-Nov-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: mount ro and remount support

This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# f3465ca4 12-Nov-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: batch extent inserts/updates/deletions on the extent root

While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions. This patch aims to try
and reduce that problem. This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion. This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf. On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert. In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update. And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk. There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# 771ed689 06-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Optimize compressed writeback and reads

When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time. Right now this is just compression. The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits. It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 537fb067 30-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: rev the disk format for fallocate

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d899e052 30-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Add fallocate support v2
This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it. This leverages the
-o nodatacow checks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 80ff3856 30-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: update nodatacow code v2

This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 9036c102 30-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: update hole handling v2

This patch splits the hole insertion code out of btrfs_setattr
into btrfs_cont_expand and updates btrfs_get_extent to properly
handle the case that file extent items are not continuous.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 09fde3c9 29-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Rev the disk format for compression and root pointer generation fields


# 84234f3a 29-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Add root tree pointer transaction ids

This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 25179201 29-Oct-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: nuke fs wide allocation mutex V2

This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# 80eb234a 29-Oct-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: fix enospc when there is plenty of space

So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.

If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.

The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.

This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.

Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>


# f82d02d9 29-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Improve space balancing code

This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# c8b97818 29-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add zlib compression support

This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cb8e7090 09-Oct-2008 Christoph Hellwig <hch@lst.de>

Btrfs: Fix subvolume creation locking rules

Creating a subvolume is in many ways like a normal VFS ->mkdir, and we
really need to play with the VFS topology locking rules. So instead of
just creating the snapshot on disk and then later getting rid of
confliting aliases do it correctly from the start. This will become
especially important once we allow for subvolumes anywhere in the tree,
and not just below a hidden root.

Note that snapshots will need the same treatment, but do to the delay
in creating them we can't do it currently. Chris promised to fix that
issue, so I'll wait on that.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 833023e4 09-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Rev the disk format for the new back reference format

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3bb1a1bc 09-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Remove offset field from struct btrfs_extent_ref

The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# a76a3cd4 09-Oct-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Count space allocated to file in bytes

This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.

Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 30c43e24 02-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: remove last_log_alloc allocator optimization

The tree logging code was trying to separate tree log allocations
from normal metadata allocations to improve writeback patterns during
an fsync.

But, the code was not effective and ended up just mixing tree log
blocks with regular metadata. That seems to be working fairly well,
so the last_log_alloc code can be removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 323ac95b 01-Oct-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: don't read leaf blocks containing only checksums during truncate

Checksum items take up a significant portion of the metadata for large files.
It is possible to avoid reading them during truncates by checking the keys in
the higher level nodes.

If a given leaf is followed by another leaf where the lowest key is a checksum
item from the same file, we know we can safely delete the leaf without
reading it.

For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
the file with a cold cache. It is read bound during the run.

With this change, Btrfs is able to delete the file in 0.5s

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d352ac68 29-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: add and improve comments

This improves the comments at the top of many functions. It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8c8bee1d 29-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Wait for IO on the block device inodes of newly added devices

btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
The kernel code needs to wait for this IO to finish before it adds
the device.

btrfs metadata IO does not happen through the block device inode. A
separate address space is used, allowing the zero filled buffer heads in
the block device inode to be written to disk after FS metadata starts
going down to the disk via the btrfs metadata inode.

The end result is zero filled metadata blocks after adding new devices
into the filesystem.

The fix is a simple filemap_write_and_wait on the block device inode
before actually inserting it into the pool of available devices.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1a40e23b 26-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: update space balancing code

This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5b21f2ed 26-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: extent_map and data=ordered fixes for space balancing

* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated. This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode. The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl. It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules. Transaction start goes outside
the drop_mutex. This allows btrfs_commit_transaction to directly
drop the relocation trees.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e4657689 26-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: Add shared reference cache

Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots. To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.

This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e8569813 26-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: allocator fixes for space balancing update

* Reserved extent accounting: reserved extents have been
allocated in the rbtrees that track free space but have not
been allocated on disk. They were never properly accounted for
in the past, making it hard to know how much space was really free.

* btrfs_find_block_group used to return NULL for block groups that
had been removed by the space balancing code. This made it hard
to account for space during the final stages of a balance run.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2b1f55b0 24-Sep-2008 Chris Mason <chris.mason@oracle.com>

Remove Btrfs compat code for older kernels

Btrfs had compatibility code for kernels back to 2.6.18. These have
been removed, and will be maintained in a separate backport
git tree from now on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 31840ae1 23-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: Full back reference support

This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0f9dd46c 23-Sep-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: free space accounting redo

1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.

4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d0c803c4 11-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Record dirty pages tree-log pages in an extent_io tree

This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk. There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6527cdbe 05-Sep-2008 Zheng Yan <zheng.yan@oracle.com>

Btrfs: Update find free objectid function for orphan cleanup code

Orphan items use BTRFS_ORPHAN_OBJECTID (-5UUL) as key objectid. This
affects the find free objectid functions, inode objectid can easily
overflow after orphan file cleanup.

---

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a237d2a2 05-Sep-2008 Christoph Hellwig <hch@lst.de>

remove unused function btrfs_ilookup

btrfs_ilookup is unused, which is good because a normal filesystem
should never have to use ilookup anyway. Remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 91c0827d 05-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Rev the disk format

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e02119d5 05-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a write ahead tree log to optimize synchronous operations

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f3f9931e 21-Aug-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Rev the disk format

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1a54ef8c 20-Jul-2008 Balaji Rao <balajirrao@gmail.com>

Introduce btrfs_iget helper

Date: Mon, 21 Jul 2008 02:01:04 +0530
This patch introduces a btrfs_iget helper to be used in NFS support.

Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4854ddd0 15-Aug-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Wait for kernel threads to make progress during async submission

Before this change, btrfs would use a bdi congestion function to make
sure there weren't too many pending async checksum work items.

This change makes the process creating async work items wait instead,
leading to fewer congestion returns from the bdi. This improves
pdflush background_writeout scanning.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0986fe9e 15-Aug-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Count async bios separately from async checksum work items

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5036f538 07-Aug-2008 Eric Sandeen <sandeen@redhat.com>

Btrfs: fix RHEL test for ClearPageFsMisc

Newer RHEL5 kernels define both ClearPageFSMisc and
ClearPageChecked, so test for both before redefining.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7ea394f1 05-Aug-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Fix nodatacow for the new data=ordered mode

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ea8c2819 04-Aug-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9ca9ee09 04-Aug-2008 Sage Weil <sage@newdream.net>

Btrfs: fix ioctl-initiated transactions vs wait_current_trans()

Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
progress) breaks the transaction start/stop ioctls by making
btrfs_start_transaction conditionally wait for the next transaction to
start. If an application artificially is holding a transaction open,
things deadlock.

This workaround maintains a count of open ioctl-initiated transactions in
fs_info, and avoids wait_current_trans() if any are currently open (in
start_transaction() and btrfs_throttle()). The start transaction ioctl
uses a new btrfs_start_ioctl_transaction() that _does_ call
wait_current_trans(), effectively pushing the join/wait decision to the
outer ioctl-initiated transaction.

This more or less neuters btrfs_throttle() when ioctl-initiated
transactions are in use, but that seems like a pretty fundamental
consequence of wrapping lots of write()'s in a transaction. Btrfs has no
way to tell if the application considers a given operation as part of it's
transaction.

Obviously, if the transaction start/stop ioctls aren't being used, there
is no effect on current behavior.

Signed-off-by: Sage Weil <sage@newdream.net>
---
ctree.h | 1 +
ioctl.c | 12 +++++++++++-
transaction.c | 18 +++++++++++++-----
transaction.h | 2 ++
4 files changed, 27 insertions(+), 6 deletions(-)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 65b51a00 01-Aug-2008 Chris Mason <chris.mason@oracle.com>

btrfs_search_slot: reduce lock contention by cowing in two stages

A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 61b49440 31-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix streaming read performance with checksumming on

Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data. IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.

The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.

This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page. This results in many
extra btree lookups on large streaming reads. Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bcc63abb 30-Jul-2008 Yan <zheng.yan@oracle.com>

Btrfs: implement memory reclaim for leaf reference cache

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f321e491 30-Jul-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Update and fix mount -o nodatacow

To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ab78c84d 29-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Throttle operations if the reference cache gets too large

A large reference cache is directly related to a lot of work pending
for the cleaner thread. This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 017e5369 28-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Leaf reference cache update

This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 31153d81 28-Jul-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Add a leaf reference cache

Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3a115f52 23-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Rev the disk format magic

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7b128766 23-Jul-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: Create orphan inode records to prevent lost files after a crash

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 33268eaf 23-Jul-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: Add ACL support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6099afe8 23-Jul-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: Remove unused xattr code

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# aec7477b 23-Jul-2008 Josef Bacik <jbacik@redhat.com>

Btrfs: Implement new dir index format

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3eaa2885 24-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix the defragmention code and the block relocation code for data=ordered

Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again. This is because the extents are not removed
until the IO for the new extents is finished

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4881ee5a 24-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix some build problems on 2.6.18 based enterprise kernels

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c286ac48 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: alloc_mutex latency reduction

This releases the alloc_mutex in a few places that hold it for over long
operations. btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6dddcbeb 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use mutex_lock_nested for tree locking

Lockdep has the notion of locking subclasses so that you can identify
locks you expect to be taken after other locks of the same class. This
changes the per-extent buffer btree locking routines to use a subclass based
on the level in the tree.

Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs
max level is also 8. So when lockdep is on, use a lower max level.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f421950f 22-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix some data=ordered related data corruptions

Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree. The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed. It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3edf7d33 18-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Handle data checksumming on bios that span multiple ordered extents

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f9295749 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

btrfs_start_transaction: wait for commits in progress to finish

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed. btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction. For them, btrfs_join_transaction
is added.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 247e743c 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Use async helpers to deal with pages that have been improperly dirtied

Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e6dcd2dc 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: New data=ordered implementation

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7d9eb12c 08-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add locking around volume management (device add/remove/balance)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3f157a2f 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Online btree defragmentation fixes

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e7a84565 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add btree locking to the tree defragmentation code

The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a74a4b97 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Replace the transaction work queue with kthreads

This creates one kthread for commits and one kthread for
deleting old snapshots. All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5cd57b2c 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it

Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree. But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups. A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 051e1b9f 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Drop locks in btrfs_search_slot when reading a tree block.

One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a2135011 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Replace the big fs_mutex with a collection of other locks

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 925baedd 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Start btree concurrency work.

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes. This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree. Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked. Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1cc127b5 12-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a thread pool just for submit_bio

If a bio submission is after a lock holder waiting for the bio
on the work queue, it is possible to deadlock. Move the bios
into their own pool.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f46b5a66 11-Jun-2008 Christoph Hellwig <hch@lst.de>

Btrfs: split out ioctl.c

Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4543df7e 11-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a mount option to control worker thread pool size

mount -o thread_pool_size changes the default, which is
min(num_cpus + 2, 8). Larger thread pools would make more sense on
very large disk arrays.

This mount option controls the max size of each thread pool. There
are multiple thread pools, so the total worker count will be larger
than the mount option.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8b712842 11-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add async worker threads for pre and post IO checksumming

Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system. But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up. The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines. Two worker pools are created,
one for writes and one for endio processing. The two could deadlock if
we tried to service both from a single pool.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# edf24abe 10-Jun-2008 Christoph Hellwig <hch@lst.de>

btrfs: sanity mount option parsing and early mount code

Also adds lots of comments to describe what's going on here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6bf13c0c 10-Jun-2008 Sage Weil <sage@newdream.net>

Btrfs: transaction ioctls

These ioctls let a user application hold a transaction open while it
performs a series of operations. A final ioctl does a sync on the fs
(closing the current transaction). This is the main requirement for
Ceph's OSD to be able to keep the data it's storing in a btrfs volume
consistent, and AFAICS it works just fine. The application would do
something like

fd = ::open("some/file", O_RDONLY);
::ioctl(fd, BTRFS_IOC_TRANS_START);
/* do a bunch of stuff */
::ioctl(fd, BTRFS_IOC_TRANS_END);
or just
::close(fd);

And to ensure it commits to disk,

::ioctl(fd, BTRFS_IOC_SYNC);

When a transaction is held open, the trans_handle is attached to the
struct file (via private_data) so that it will get cleaned up if the
process dies unexpectedly. A held transaction is also ended on fsync() to
avoid a deadlock.

A misbehaving application could also deliberately hold a transaction open,
effectively locking up the FS, so it may make sense to restrict something
like this to root or something.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3b96362c 09-Jun-2008 Sven Wegener <sven.wegener@stealer.net>

Btrfs: Invalidate dcache entry after creating snapshot and

We need to invalidate an existing dcache entry after creating a new
snapshot or subvolume, because a negative dache entry will stop us from
accessing the new snapshot or subvolume.

---
ctree.h | 23 +++++++++++++++++++++++
inode.c | 4 ++++
transaction.c | 4 ++++
3 files changed, 31 insertions(+)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0ef3e66b 24-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Allocator fix variety pack

* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
try to allocate from devices that don't exist

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cb03c743 15-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Change the congestion functions to meter the number of async submits as well

The async submit workqueue was absorbing too many requests, leading to long
stalls where the async submitters were stalling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dfe25020 13-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount -o degraded to allow mounts to continue with missing devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a68d5933 08-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Update nodatacow mode to support cloned single files and resizing

Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent. This check makes sure that multiple
inodes don't have references.

nodatacow needed an extra check to see if the block group was currently
readonly. This way cows forced by the chunk relocation code are honored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bf4ef679 08-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Properly find the root for snapshotted blocks during chunk relocation

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a061fc8d 07-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for online device removal

This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev. This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface. Things got ugly keeping the mapping constant.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f2eb0a24 02-May-2008 Sage Weil <sage@newdream.net>

Btrfs: Clone file data ioctl

Add a new ioctl to clone file data

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ec44a35c 28-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add balance ioctl to restripe the chunks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 788f20eb 28-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add new ioctl to add devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8f18cf13 25-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Make the resizer work based on shrinking and growing devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7ae9c09d 18-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for labels in the super block

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a443755f 18-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Check device uuids along with devids

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e015640f 16-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Write bio checksumming outside the FS mutex

This significantly improves streaming write performance by allowing
concurrency in the data checksumming.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 44b8bd7e 16-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Create a work queue for bio writes

This allows checksumming to happen in parallel among many cpus, and
keeps us from bogging down pdflush with the checksumming code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 321aecc6 16-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add RAID10 support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e17cade2 15-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add chunk uuids and update multi-device back references

Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 98d20f67 14-Apr-2008 Chris Mason <chris.mason@oracle.com>

Add a min size parameter to btrfs_alloc_extent

On huge machines, delayed allocation may try to allocate massive extents.
This change allows btrfs_alloc_extent to return something smaller than
the caller asked for, and the data allocation routines will loop over
the allocations until it fills the whole delayed alloc.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ce9adaa5 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Do metadata checksums for reads via a workqueue

Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read. It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d18a2c44 04-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix allocation profile init

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 611f0e00 03-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for duplicate blocks on a single spindle

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8790d502 03-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for mirroring across drives

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 63b10fc4 01-Apr-2008 Chris Mason <chris.mason@oracle.com>

Reorder the flags field in struct btrfs_header and record a flag on writeout

This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 04160088 26-Mar-2008 Chris Mason <chris.mason@oracle.com>

Create a btrfs backing dev info

This allows intelligent versions of unplug and congestion functions

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 593060d7 25-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Implement raid0 when multiple devices are present

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8a4b83cc 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for device scanning and detection ioctls

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 239b14b3 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Bring back mount -o ssd optimizations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0d81ba5d 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Move device information into the super block so it can be scanned

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e085def2 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Make the FS tree the last objectid in the tree of tree roots

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6324fbf3 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Dynamic chunk and block group allocation

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0b86a832 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for multiple devices per filesystem

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 065631f6 19-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: checksum file data at bio submission time instead of during writepage

When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.

This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.

Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# df68b8a7 15-Feb-2008 David Miller <davem@davemloft.net>

Btrfs: unaligned access fixes

Btrfs set/get macros lose type information needed to avoid
unaligned accesses on sparc64.
ere is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.

btrfs_name_hash is modified to return the hash value instead
of getting a return location via a (potentially unaligned)
pointer.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9069218d 08-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix i_blocks accounting

Now that delayed allocation accounting works, i_blocks accounting is changed
to only modify i_blocks when extents inserted or removed.

The fillattr call is changed to include the delayed allocation byte count
in the i_blocks result.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 47b0c4f8 04-Feb-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Update magic

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4529ba49 31-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add data block hints to SSD mode too

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6f568d35 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: mount -o max_inline=size to control the maximum inline extent size

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9c58309d 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add inode item and backref in one insert, reducing cpu usage

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 85e21bac 29-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: During deletes and truncate, remove many items at once from the tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d1310b2e 24-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Split the extent_map code into two parts

There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5f56406a 22-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix hole insertion corner cases

There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.

lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e18e4809 18-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount -o ssd, which includes optimizations for seek free storage

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2da98f00 16-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Run igrab on data=ordered inodes to prevent deadlocks during writeout

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cee36a03 15-Jan-2008 Chris Mason <chris.mason@oracle.com>

Rework btrfs_drop_inode to avoid scheduling

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 61295eb8 14-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add drop inode func to avoid data=ordered deadlock

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 69a32ac5 14-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Change magic string to reflect new format

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fdebe2bd 14-Jan-2008 Yan <yanzheng@21cn.com>

Btrfs: Add readonly inode flag

This patch adds readonly inode flag support. A file with this flag
can't be modified, but can be deleted.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 21ad10cf 09-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add flush barriers on commit

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b98b6767 08-Jan-2008 Yan <yanzheng@21cn.com>

Btrfs: Add inode flags support

This patch adds NODATASUM & NODATACOW inode flags support.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e2008b61 08-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletion

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dc17ff8f 08-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add data=ordered support

This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4313b399 03-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Reduce stack usage in the resizer, fix 32 bit compiles

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8f662a76 02-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# edbd8d4e 21-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Support for online FS resize (grow and shrink)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1832a6d5 21-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Implement basic support for -ENOSPC

This is intended to prevent accidentally filling the drive. A determined
user can still make things oops.

It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6da6abae 18-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Back port to 2.6.18-el kernels

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c59f8951 17-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount option to enforce a max extent size

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# be20aa9d 17-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount option to turn off data cow

A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.

In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b6cda9bc 14-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount -o nodatasum to turn of file data checksumming

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f6dbff55 13-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Reorder extent back refs to differentiate btree blocks from file data

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3954401f 12-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add back pointers from the inode to the directory that references it

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7bb86316 11-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add back pointers from extents to the btree or file referencing them

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 74493f7a 11-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Implement generation numbers in block pointers

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 87ee04eb 30-Nov-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add simple stripe size parameter

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 00f5c795 30-Nov-2007 Chris Mason <chris.mason@oracle.com>

btrfs_drop_extents: make sure the item is getting smaller before truncate

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 324ae4df 16-Nov-2007 Yan <yanzheng@21cn.com>

Btrfs: Add block group pinned accounting back

This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group', the last parameter of the function indicates
pin vs unpin.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5103e947 16-Nov-2007 Josef Bacik <jbacik@redhat.com>

xattr support for btrfs

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e644d021 06-Nov-2007 Chris Mason <chris.mason@oracle.com>

Fix recursive KM_USER1 usage in btrfs_realloc_node

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f84a8b36 06-Nov-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Optimize allocations as we need to mix data and metadata into one group

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 179e29e4 01-Nov-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix a number of inline extent problems that Yan Zheng reported.

The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place. It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f578d4bd 25-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Optimize csum insertion to create larger items when possible

This reduces the number of calls to btrfs_extend_item and greatly lowers
the cpu usage while writing large files.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6d7231f7 19-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix typo: owner is a 64 bit field

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a6b6e75e 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Defrag only leaves, and only when the parent node has a single objectid

This allows us to defrag huge directories, but skip the expensive defrag
case in more common usage, where it does not help as much.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 19c00ddc 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add back metadata checksumming

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0f82731f 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 810191ff 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: extent_map optimizations to cut down on CPU usage

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3326d1b0 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Allow tails larger than one page

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 14048ed0 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Cache extent buffer mappings

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# db94535d 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Allow tree blocks larger than the page size

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1a5bc167 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 96b5179d 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Stop using radix trees for the block group cache

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f510cfec 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix extent_buffer and extent_state leaks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6d36dcd4 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Avoid memcpy where possible in extent_buffers

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 479965d6 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Optimizations for the extent_buffer code

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5f39d397 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Create extent_buffer interface for large blocksizes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 34287aa3 14-Sep-2007 Christoph Hellwig <hch@lst.de>

Btrfs: use unlocked_ioctl

No reason to grab the BKL before calling into the btrfs ioctl code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5ce14bbc 11-Sep-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Find and remove dead roots the first time a root is loaded.

Dead roots are trees left over after a crash, and they were either in the
process of being removed or were waiting to be removed when the box crashed.
Before, a search of the entire tree of root pointers was done on mount
looking for dead roots. Now, the search is done the first time we load
a root.

This makes mount faster when there are a large number of snapshots, and it
enables the block accounting code to properly update the block counts on
the latest root as old versions of the root are reaped after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 011410bd 10-Sep-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add more synchronization before creating a snapshot

File data checksums are only done during writepage, so we have to make sure
all pages are written when the snapshot is taken. This also adds some
locking so that new writes don't race in and add new dirty pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 95e05289 29-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Use mount -o subvol to select the subvol directory instead of dev:

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 58176a96 29-Aug-2007 Josef Bacik <jbacik@redhat.com>

Btrfs: Add per-root block accounting and sysfs entries

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a52d9a80 27-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Extent based page cache code. This uses an rbtree of extents and tests
instead of buffer heads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 15ee9bc7 10-Aug-2007 Josef Bacik <jwhiter@redhat.com>

Btrfs: delay commits during fsync to allow more writers

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e9d0b13b 10-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Btree defrag on the extent-mapping tree as well

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 26b8003f 08-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Replace extent tree preallocation code with some bit radix magic.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f4468e94 08-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Let some locks go during defrag and snapshot dropping

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6702ed49 07-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add run time btree defrag, and an ioctl to force btree defrag

This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3c69faec 07-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fold some btree readahead routines into something more generic.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9f3a7427 07-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Do snapshot deletion in smaller chunks.

Before, snapshot deletion was a single atomic unit. This caused considerable
lock contention and required an unbounded amount of space. Now,
the drop_progress field in the root item is used to indicate how far along
snapshot deletion is, and to resume where it left off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ccd467d6 28-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: crash recovery fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4b52dff6 26-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix super block updates during transaction commit

The super block written during commit was not consistent with the state of
the trees. This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5eda7b5e 22-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add the ability to find and remove dead roots after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 54aa1f4d 22-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 11bd143f 22-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Switch to libcrc32c to avoid problems with cryptomgr on highmem machines

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9ebefb18 15-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: patch queue: page_mkwrite

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6cbd5570 12-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add GPLv2

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 39279cc3 12-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: split up super.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5276aeda 11-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: fix oops after block group lookup

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0cf6c620 09-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: remove device tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# facda1e7 08-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: get forced transaction commits via workqueue

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 08607c1b 08-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add compat ioctl

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fabb5681 07-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: d_type optimization

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1e2677e0 29-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: block group switching

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1de037a4 29-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: fixup various fsx failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3a686375 24-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: sparse files!

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e06afa83 23-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: rename

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f9f3c6b6 21-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: 2.6.21-git fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# de428b63 18-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: allocator optimizations, truncate readahead

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 509659cd 09-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: switch to crc32c instead of sha256

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e37c9e69 09-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: many allocator fixes, pretty solid

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3e1ad54f 07-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: allocator and tuning

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# be744175 06-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: more allocator enhancements

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# be08c1b9 03-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early metadata/data split

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 35b7e476 02-May-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: fix page cache memory leak

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 31f3c99b 30-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: allocator improvements, inode block groups

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cd1bc465 27-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: more block allocator work

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9078a3e1 26-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: start of block group code

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f2458e1d 25-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: change around extent-tree prealloc

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c62a1920 23-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: get rid of the extent_item type field

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4d775673 20-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add owner and type fields to the extents aand block headers

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e66f709b 20-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: write barriers on commit, balance level before split

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8fd17795 19-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early fsync support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7e38180e 19-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: directory inode index is back

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 236454df 19-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: many file_write fixes, inline data

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a429e513 18-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: working file_write, reorganized key flags

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 70b2befd 17-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: rework csums and extent item ordering

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b18c6685 17-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: progress on file_write

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6567e837 16-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early work to file_write in big extents

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b4100d64 11-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add a device id to device items

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7eccb903 11-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: create a logical->phsyical block number mapping scheme

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0bd93ba0 11-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early support for multiple devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cac87faa 11-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: use a dedicated inode num for root root dir

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d0dbc624 09-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: drop owner and parentid

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1b05da2e 09-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: drop the inode map tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c5739bba 10-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: snapshot progress

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0f7d52f4 09-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: groundwork for subvolume and snapshot roots

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d6e4a428 06-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: start of support for many FS volumes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5be6f7f1 05-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: dirindex optimizations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7fcde0e3 04-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: finish off inode indexing in dirs, add overflows

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5f26f772 05-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: more inode indexed directory work

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bae45de0 04-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add dir inode index

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b1a4d965 04-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: tweak the inode-map and free extent search starts on cold mount

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2c90e5d6 02-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: still corruption hunting

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d6025579 30-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: corruption hunt continues

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f254e52c 29-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: verify csums on read

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 87cbda5c 28-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: sha256 csums on metadata

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d98237b3 28-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: use a btree inode instead of sb_getblk

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9773a788 27-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: byte offsets for file keys

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 71951f35 27-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add generation field to file extent

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9a6f11ed 27-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: split out level field in struct header

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6407bf6d 27-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: reference counts on data extents

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dee26a9f 26-Mar-2007 Chris Mason <chris.mason@oracle.com>

btrfs_get_block, file read/write

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8ef97622 26-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add a radix back bit tree

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d561c025 23-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: very minimal locking

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7f5c1516 23-Mar-2007 Chris Mason <chris.mason@oracle.com>

Add generation number to btrfs_header, readdir fixes, hash collision fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d5719762 23-Mar-2007 Chris Mason <chris.mason@oracle.com>

btrfs_create, btrfs_write_super, btrfs_sync_fs

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 79154b1b 22-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: transaction rework

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e20d96d6 21-Mar-2007 Chris Mason <chris.mason@oracle.com>

Mountable btrfs, with readdir

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2e635a27 21-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: initial move to kernel module land

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1261ec42 20-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Better block record keeping, real mkfs

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 293ffd5f 20-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: change dir-test to insert inode_items

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9f5fae2f 20-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add inode map, and the start of file extent items

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e089f05c 16-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: transaction handles everywhere

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 88fd146c 16-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: pin freed blocks from the FS tree too

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a8a2ee0c 16-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add a name_len to dir items, reorder key

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1e1d2701 15-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add inode item

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1d4f6404 15-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: directory testing code and dir item fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 62e2749e 14-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Use a chunk of the key flags to record the item type.
Add (untested and simple) directory item code
Fix comp_keys to use the new key ordering
Add btrfs_insert_empty_item

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a1516c89 14-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: reorder key offset and flags

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 123abc88 14-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: variable block size support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 4beb1b8b 14-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add leaf data casting helper

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3768f368 13-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Change the super to point to a tree of trees to enable persistent snapshots

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 234b63a0 13-Mar-2007 Chris Mason <chris.mason@oracle.com>

rename funcs and structs to btrfs

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cf27e1ee 13-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: struct extent_item endian

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1d4f8a0c 13-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: node->blockptrs endian fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0783fcfc 12-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: struct item endian fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e2fa7227 12-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: struct key endian fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bb492bb0 11-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add sparse endian annotations to struct header
rename struct header to btrfs_header

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7518a238 11-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: get/set for struct header fields

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0579da42 07-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fixup last found extent caching

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a28ec197 06-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Fixup reference counting on cows

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 02217ed2 02-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early reference counting

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ed2ff2cb 01-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: pretend page cache & commit code

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# fec577fb 26-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add fsx-style randomized tree tester
Add debug-tree command to print the tree
Add extent-tree.c to the repo
Comment ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5de08d7d 24-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Break up ctree.c a little
Extent fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9a8dd150 23-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Block sized tree extents and extent deletion

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cfaa7295 21-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: extent fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d97e63b6 20-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early extent mapping support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# eb60ceac 02-Feb-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add backing store, memory management

Signed-off-by: Chris Mason <chris.mason@oracle.com>