History log of /linux-master/fs/btrfs/transaction.h
Revision Date Author Comments
# f33163ee 13-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: remove no longer used btrfs_transaction_in_commit()

The function btrfs_transaction_in_commit() is no longer used, its last
use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily slow
down delalloc if we're committing"), so just remove it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 602035d7 26-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: add forward declarations and headers, part 2

Do a cleanup in more headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>


# b321a52c 01-Dec-2023 Boris Burkov <boris@bur.io>

btrfs: free qgroup pertrans reserve on transaction abort

If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6008859b 04-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: add and use helpers for reading and writing log_transid

Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.

Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1723270f 22-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: move btrfs_defrag_root() to defrag.{c,h}

The btrfs_defrag_root() function does not really belong in the
transaction.{c,h} module and as we have a defrag.{c,h} nowadays,
move it to there instead. This also allows to stop exporting
btrfs_defrag_leaves(), so we can make it static.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename info to fs_info for consistency ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ed164802 08-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: rename errno identifiers to error

We sync the kernel files to userspace and the 'errno' symbol is defined
by standard library, which does not matter in kernel but the parameters
or local variables could clash. Rename them all.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28270e25 08-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: always reserve space for delayed refs when starting transaction

When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
the caller wants to be inserted/modified/deleted in a btree. This space
goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
than the amount of its reserved space, and the flush method is
BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
it corresponding to the number of items the caller wants to
insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
creates delayed refs after COWing an extent buffer, allocating a new
one or deleting (freeing) an extent buffer. This happens after the
the task started or joined a transaction, whenever it calls
btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
operations or when someone else calls btrfs_start_transaction() with
a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
transaction block reserve. When the task releases its transaction
handle (btrfs_end_transaction()) or it attempts to commit the
transaction, it releases any remaining space in the transaction block
reserve that it did not use, as not all space may have been used (due
to pessimistic space calculation) by calling btrfs_block_rsv_release()
which will try to add that unused space to the delayed refs block
reserve (if its current size is greater than its reserved space).
That transferred space may not be enough to completely fulfill the
delayed refs block reserve.

Plus we have some tasks that will attempt do modify as many leaves
as they can before getting -ENOSPC (and then reserving more space and
retrying), such as hole punching and extent cloning which call
btrfs_replace_file_extents(). Such tasks can generate therefore a
high number of delayed refs, for both metadata and data (we can't
know in advance how many file extent items we will find in a range
and therefore how many delayed refs for dropping references on data
extents we will generate);

6) If a transaction starts its commit before the delayed refs block
reserve is refilled, for example by the transaction kthread or by
someone who called btrfs_join_transaction() before starting the
commit, then when running delayed references if we don't have enough
reserved space in the delayed refs block reserve, we will consume
space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
transaction, since we have no guarantees the delayed refs block
reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
number of items greater than zero and at the time the delayed refs
reserve is full, then we don't reserve any space at
btrfs_start_transaction() for the delayed refs that will be generated
by a task, and we can therefore end up using a lot of space from the
global reserve when running the delayed refs during a transaction
commit;

4) There are also other operations that result in bumping the size of the
delayed refs reserve, such as creating and deleting block groups, as
well as the need to update a block group item because we allocated or
freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
and its reserved space, two very bad things may happen:

1) The reserved space of the global reserve may not be enough and we
fail the transaction commit with -ENOSPC when running delayed refs;

2) If the available space in the global reserve is enough it may result
in nearly exhausting it. If the fs has no more unallocated device
space for allocating a new block group and all the available space
in existing metadata block groups is not far from the global
reserve's size before we started the transaction commit, we may end
up in a situation where after the transaction commit we have too
little available metadata space, and any future transaction commit
will fail with -ENOSPC, because although we were able to reserve
space to start the transaction, we were not able to commit it, as
running delayed refs generates some more delayed refs (to update the
extent tree for example) - this includes not even being able to
commit a transaction that was started with the goal of unlinking a
file, removing an empty data block group or doing reclaim/balance,
so there's no way to release metadata space.

In the worst case the next time we mount the filesystem we may
also fail with -ENOSPC due to failure to commit a transaction to
cleanup orphan inodes. This later case was reported and hit by
someone running a SLE (SUSE Linux Enterprise) distribution for
example - where the fs had no more unallocated space that could be
used to allocate a new metadata block group, and the available
metadata space was about 1.5M, not enough to commit a transaction
to cleanup an orphan inode (or do relocation of data block groups
that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# adb86dbe 08-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: stop doing excessive space reservation for csum deletion

Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:

1) It uses btrfs_calc_insert_metadata_size() which is excessive because
we only need to delete csum items from the csum tree, we don't need
to insert any items, so btrfs_calc_metadata_size() is all we need (as
it computes space needed to delete an item);

2) If the free space tree is enabled, it doubles the amount of space,
which is pointless for csum deletion since we don't need to touch the
free space tree or any other tree other than the csum tree.

So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.

For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/test

umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT

# 100G gives at least 983040 extents with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar

# Flush all delalloc and clear all metadata from memory.
umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
rm -f $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "rm took $dur milliseconds"

umount $MNT

Before this change rm took: 7504 milliseconds
After this change rm took: 6574 milliseconds (-12.4%)

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f8d1b011 26-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: always print transaction aborted messages with an error level

Commit b7af0635c87f ("btrfs: print transaction aborted messages with an
error level") changed the log level of transaction aborted messages from
a debug level to an error level, so that such messages are always visible
even on production systems where the log level is normally above the debug
level (and also on some syzbot reports).

Later, commit fccf0c842ed4 ("btrfs: move btrfs_abort_transaction to
transaction.c") changed the log level back to debug level when the error
number for a transaction abort should not have a stack trace printed.
This happened for absolutely no reason. It's always useful to print
transaction abort messages with an error level, regardless of whether
the error number should cause a stack trace or not.

So change back the log level to error level.

Fixes: fccf0c842ed4 ("btrfs: move btrfs_abort_transaction to transaction.c")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 77d20c68 24-Aug-2023 Josef Bacik <josef@toxicpanda.com>

btrfs: do not block starts waiting on previous transaction commit

Internally I got a report of very long stalls on normal operations like
creating a new file when auto relocation was running. The reporter used
the 'bpf offcputime' tracer to show that we would get stuck in
start_transaction for 5 to 30 seconds, and were always being woken up by
the transaction commit.

Using my timing-everything script, which times how long a function takes
and what percentage of that total time is taken up by its children, I
saw several traces like this

1083 took 32812902424 ns
29929002926 ns 91.2110% wait_for_commit_duration
25568 ns 7.7920e-05% commit_fs_roots_duration
1007751 ns 0.00307% commit_cowonly_roots_duration
446855602 ns 1.36182% btrfs_run_delayed_refs_duration
271980 ns 0.00082% btrfs_run_delayed_items_duration
2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
9656 ns 2.9427e-05% switch_commit_roots_duration
1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration

Here I was only tracing functions that happen where we are between
START_COMMIT and UNBLOCKED in order to see what would be keeping us
blocked for so long. The wait_for_commit() we do is where we wait for a
previous transaction that hasn't completed it's commit. This can
include all of the unpin work and other cleanups, which tends to be the
longest part of our transaction commit.

There is no reason we should be blocking new things from entering the
transaction at this point, it just adds to random latency spikes for no
reason.

Fix this by adding a PREP stage. This allows us to properly deal with
multiple committers coming in at the same time, we retain the behavior
that the winner waits on the previous transaction and the losers all
wait for this transaction commit to occur. Nothing else is blocked
during the PREP stage, and then once the wait is complete we switch to
COMMIT_START and all of the same behavior as before is maintained.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f880fe6e 08-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't hold an extra reference for redirtied buffers

When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit. But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).

Remove the extra reference and the infrastructure used to drop it
again.

History behind redirty logic:

In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>


# fccf0c84 07-Dec-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_abort_transaction to transaction.c

While trying to sync messages.[ch] I ended up with this dependency on
messages.h in the rest of btrfs-progs code base because it's where
btrfs_abort_transaction() was now held. We want to keep messages.[ch]
limited to the kernel code, and the btrfs_abort_transaction() code
better fits in the transaction code and not in messages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the __cold attributes ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 55e5cfd3 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: remove fs_info::pending_changes and related code

Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.

We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.

[1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# cc37ea61 09-Sep-2022 David Sterba <dsterba@suse.com>

btrfs: convert __TRANS_* defines to enum bits

The base transaction bits can be defined as bits in a contiguous
sequence, although right now there's a hole from bit 1 to 8.

The bits are used for btrfs_trans_handle::type, and there's another set
of TRANS_STATE_* defines that are for btrfs_transaction::state. They are
mutually exclusive though the hole in the sequence looks like was made
for the states.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 956504a3 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move trans_handle_cachep out of ctree.h

This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33c44184 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: pass btrfs_fs_info for deleting snapshots and cleaner

We're passing a root around here, but we only really need the fs_info,
so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead,
and then fix up all the callers appropriately.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b4be6aef 18-Feb-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: do not start relocation until in progress drops are done

We hit a bug with a recovering relocation on mount for one of our file
systems in production. I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time. This
presented as an error while looking up an extent item

WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
RIP: 0010:lookup_inline_extent_backref+0x647/0x680
RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
Call Trace:
<TASK>
insert_inline_extent_backref+0x46/0xd0
__btrfs_inc_extent_ref.isra.0+0x5f/0x200
? btrfs_merge_delayed_refs+0x164/0x190
__btrfs_run_delayed_refs+0x561/0xfa0
? btrfs_search_slot+0x7b4/0xb30
? btrfs_update_root+0x1a9/0x2c0
btrfs_run_delayed_refs+0x73/0x1f0
? btrfs_update_root+0x1a9/0x2c0
btrfs_commit_transaction+0x50/0xa50
? btrfs_update_reloc_root+0x122/0x220
prepare_to_merge+0x29f/0x320
relocate_block_group+0x2b8/0x550
btrfs_relocate_block_group+0x1a6/0x350
btrfs_relocate_chunk+0x27/0xe0
btrfs_balance+0x777/0xe60
balance_kthread+0x35/0x50
? btrfs_balance+0xe60/0xe60
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>

Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex. However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.

Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.

Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key. If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info. Then
balance can wait until this flag is cleared to start up again.

If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 28b21c55 21-Jan-2022 Filipe Manana <fdmanana@suse.com>

btrfs: fix use-after-free after failure to create a snapshot

At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
then attach it to the transaction's list of pending snapshots. After that
we call btrfs_commit_transaction(), and if that returns an error we jump
to 'fail' label, where we kfree() the pending snapshot structure. This can
result in a later use-after-free of the pending snapshot:

1) We allocated the pending snapshot and added it to the transaction's
list of pending snapshots;

2) We call btrfs_commit_transaction(), and it fails either at the first
call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
In both cases, we don't abort the transaction and we release our
transaction handle. We jump to the 'fail' label and free the pending
snapshot structure. We return with the pending snapshot still in the
transaction's list;

3) Another task commits the transaction. This time there's no error at
all, and then during the transaction commit it accesses a pointer
to the pending snapshot structure that the snapshot creation task
has already freed, resulting in a user-after-free.

This issue could actually be detected by smatch, which produced the
following warning:

fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list

So fix this by not having the snapshot creation ioctl directly add the
pending snapshot to the transaction's list. Instead add the pending
snapshot to the transaction handle, and then at btrfs_commit_transaction()
we add the snapshot to the list only when we can guarantee that any error
returned after that point will result in a transaction abort, in which
case the ioctl code can safely free the pending snapshot and no one can
access it anymore.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7a60751a 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: remove trans_handle->root

Nobody is using this anymore, remove it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fdfbf020 05-Nov-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: rework async transaction committing

Currently we do this awful thing where we get another ref on a trans
handle, async off that handle and commit the transaction from that work.
Because we do this we have to mess with current->journal_info and the
freeze counting stuff.

We already have an async thing to kick for the transaction commit, the
transaction kthread. Replace this work struct with a flag on the
fs_info to tell the kthread to go ahead and commit even if it's before
our timeout. Then we can drastically simplify the async transaction
commit path.

Note: this can be simplified and functionality based on the pending
operation COMMIT.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 79bd3712 29-Jun-2021 Filipe Manana <fdmanana@suse.com>

btrfs: rework chunk allocation to avoid exhaustion of the system chunk array

Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cb3db1c 29-Jun-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix deadlock with concurrent chunk allocations involving system chunks

When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").

However there are two very similar scenarios where this can lead to a
deadlock:

1) Task B allocated a new system chunk and task A is waiting on task B
to finish creation of the respective system block group. However before
task B ends its transaction handle and finishes the creation of the
system block group, it attempts to allocate another chunk (like a data
chunk for an fallocate operation for a very large range). Task B will
be unable to progress and allocate the new chunk, because task A set
space_info->chunk_alloc to 1 and therefore it loops at
btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
and set space_info->chunk_alloc to 0, but task A is waiting on task B
to finish creation of the new system block group, therefore resulting
in a deadlock;

2) Task B allocated a new system chunk and task A is waiting on task B to
finish creation of the respective system block group. By the time that
task B enter the final phase of block group allocation, which happens
at btrfs_create_pending_block_groups(), when it modifies the extent
tree, the device tree or the chunk tree to insert the items for some
new block group, it needs to allocate a new chunk, so it ends up at
btrfs_chunk_alloc() and keeps looping there because task A has set
space_info->chunk_alloc to 1, but task A is waiting for task B to
finish creation of the new system block group and release the reserved
system space, therefore resulting in a deadlock.

In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.

Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 35b22c19 21-Jun-2021 Filipe Manana <fdmanana@suse.com>

btrfs: send: fix crash when memory allocations trigger reclaim

When doing a send we don't expect the task to ever start a transaction
after the initial check that verifies if commit roots match the regular
roots. This is because after that we set current->journal_info with a
stub (special value) that signals we are in send context, so that we take
a read lock on an extent buffer when reading it from disk and verifying
it is valid (its generation matches the generation stored in the parent).
This stub was introduced in 2014 by commit a26e8c9f75b0bf ("Btrfs: don't
clear uptodate if the eb is under IO") in order to fix a concurrency issue
between send and balance.

However there is one particular exception where we end up needing to start
a transaction and when this happens it results in a crash with a stack
trace like the following:

[60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
[60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
[60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
[60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
[60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
[60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
[60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
[60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
[60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
[60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
[60015.902408] kernel: FS: 00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
[60015.902410] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
[60015.902414] kernel: Call Trace:
[60015.902419] kernel: kfence_unprotect+0x13/0x30
[60015.902423] kernel: page_fault_oops+0x89/0x270
[60015.902427] kernel: ? search_module_extables+0xf/0x40
[60015.902431] kernel: ? search_bpf_extables+0x57/0x70
[60015.902435] kernel: kernelmode_fixup_or_oops+0xd6/0xf0
[60015.902437] kernel: __bad_area_nosemaphore+0x142/0x180
[60015.902440] kernel: exc_page_fault+0x67/0x150
[60015.902445] kernel: asm_exc_page_fault+0x1e/0x30
[60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
[60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
[60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
[60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
[60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
[60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
[60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
[60015.902468] kernel: btrfs_commit_inode_delayed_inode+0x5d/0x120
[60015.902473] kernel: btrfs_evict_inode+0x2c5/0x3f0
[60015.902476] kernel: evict+0xd1/0x180
[60015.902480] kernel: inode_lru_isolate+0xe7/0x180
[60015.902483] kernel: __list_lru_walk_one+0x77/0x150
[60015.902487] kernel: ? iput+0x1a0/0x1a0
[60015.902489] kernel: ? iput+0x1a0/0x1a0
[60015.902491] kernel: list_lru_walk_one+0x47/0x70
[60015.902495] kernel: prune_icache_sb+0x39/0x50
[60015.902497] kernel: super_cache_scan+0x161/0x1f0
[60015.902501] kernel: do_shrink_slab+0x142/0x240
[60015.902505] kernel: shrink_slab+0x164/0x280
[60015.902509] kernel: shrink_node+0x2c8/0x6e0
[60015.902512] kernel: do_try_to_free_pages+0xcb/0x4b0
[60015.902514] kernel: try_to_free_pages+0xda/0x190
[60015.902516] kernel: __alloc_pages_slowpath.constprop.0+0x373/0xcc0
[60015.902521] kernel: ? __memcg_kmem_charge_page+0xc2/0x1e0
[60015.902525] kernel: __alloc_pages_nodemask+0x30a/0x340
[60015.902528] kernel: pipe_write+0x30b/0x5c0
[60015.902531] kernel: ? set_next_entity+0xad/0x1e0
[60015.902534] kernel: ? switch_mm_irqs_off+0x58/0x440
[60015.902538] kernel: __kernel_write+0x13a/0x2b0
[60015.902541] kernel: kernel_write+0x73/0x150
[60015.902543] kernel: send_cmd+0x7b/0xd0
[60015.902545] kernel: send_extent_data+0x5a3/0x6b0
[60015.902549] kernel: process_extent+0x19b/0xed0
[60015.902551] kernel: btrfs_ioctl_send+0x1434/0x17e0
[60015.902554] kernel: ? _btrfs_ioctl_send+0xe1/0x100
[60015.902557] kernel: _btrfs_ioctl_send+0xbf/0x100
[60015.902559] kernel: ? enqueue_entity+0x18c/0x7b0
[60015.902562] kernel: btrfs_ioctl+0x185f/0x2f80
[60015.902564] kernel: ? psi_task_change+0x84/0xc0
[60015.902569] kernel: ? _flat_send_IPI_mask+0x21/0x40
[60015.902572] kernel: ? check_preempt_curr+0x2f/0x70
[60015.902576] kernel: ? selinux_file_ioctl+0x137/0x1e0
[60015.902579] kernel: ? expand_files+0x1cb/0x1d0
[60015.902582] kernel: ? __x64_sys_ioctl+0x82/0xb0
[60015.902585] kernel: __x64_sys_ioctl+0x82/0xb0
[60015.902588] kernel: do_syscall_64+0x33/0x40
[60015.902591] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
[60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
[60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
[60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
[60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
[60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
[60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
[60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---

This happens because when writing to the pipe, by calling kernel_write(),
we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
gfp flags, which allow reclaim to happen if there is memory pressure. This
allocation happens at fs/pipe.c:pipe_write().

If the reclaim is triggered, inode eviction can be triggered and that in
turn can result in starting a transaction if the inode has a link count
of 0. The transaction start happens early on during eviction, when we call
btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
there is currently an open file descriptor for an inode with a link count
of 0 and the reclaim task gets a reference on the inode before that
descriptor is closed, in which case the reclaim task ends up doing the
final iput that triggers the inode eviction.

When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
the following assertion at transaction.c:start_transaction():

/* Send isn't supposed to start transactions. */
ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);

And when assertions are not enabled, it triggers a crash since after that
assertion we cast current->journal_info into a transaction handle pointer
and then dereference it:

if (current->journal_info) {
WARN_ON(type & TRANS_EXTWRITERS);
h = current->journal_info;
refcount_inc(&h->use_count);
(...)

Which obviously results in a crash due to an invalid memory access.

The same type of issue can happen during other memory allocations we
do directly in the send code with kmalloc (and friends) as they use
GFP_KERNEL and therefore may trigger reclaim too, which started to
happen since 2016 after commit e780b0d1c1523e ("btrfs: send: use
GFP_KERNEL everywhere").

The issue could be solved by setting up a NOFS context for the entire
send operation so that reclaim could not be triggered when allocating
memory or pages through kernel_write(). However that is not very friendly
and we can in fact get rid of the send stub because:

1) The stub was introduced way back in 2014 by commit a26e8c9f75b0bf
("Btrfs: don't clear uptodate if the eb is under IO") to solve an
issue exclusive to when send and balance are running in parallel,
however there were other problems between balance and send and we do
not allow anymore to have balance and send run concurrently since
commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due
to concurrent relocation"). More generically the issues are between
send and relocation, and that last commit eliminated only the
possibility of having send and balance run concurrently, but shrinking
a device also can trigger relocation, and on zoned filesystems we have
relocation of partially used block groups triggered automatically as
well. The previous patch that has a subject of:

"btrfs: ensure relocation never runs while we have send operations running"

Addresses all the remaining cases that can trigger relocation.

2) We can actually allow starting and even committing transactions while
in a send context if needed because send is not holding any locks that
would block the start or the commit of a transaction.

So get rid of all the logic added by commit a26e8c9f75b0bf ("Btrfs: don't
clear uptodate if the eb is under IO"). We can now always call
clear_extent_buffer_uptodate() at verify_parent_transid() since send is
the only case that uses commit roots without having a transaction open or
without holding the commit_root_sem.

Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32cc4f87 03-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: sink wait_for_unblock parameter to async commit

There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
can remove the switch in btrfs_commit_transaction_async.

A cleanup 9babda9f33fd ("btrfs: Remove async_transid from
btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
1, so this is a followup.

As this removes last call of wait_current_trans_commit_start_and_unblock,
remove the function as well.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5963ffca 20-May-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: always abort the transaction if we abort a trans handle

While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.

Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.

This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data. This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.

This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.

Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eafa4fd0 31-Mar-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix exhaustion of the system chunk array due to concurrent allocations

When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:

[1359.518899] ------------[ cut here ]------------
[1359.518980] BTRFS: Transaction aborted (error -27)
[1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[1359.519152] Modules linked in: (...)
[1359.519239] Supported: Yes, External
[1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3
[1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
[1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default)
[1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007
[1359.519356] CFAR: c00000000013e170 IRQMASK: 0
[1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
[1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
[1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
[1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
[1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
[1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
[1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
[1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
[1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
[1359.519626] Call Trace:
[1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
[1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
[1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
[1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
[1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
[1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
[1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
[1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
[1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
[1359.520037] Instruction dump:
[1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
[1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
[1359.520122] ---[ end trace d6c186e151022e20 ]---

The following steps explain how we can end up in this situation:

1) Task A is at check_system_chunk(), either because it is allocating a
new data or metadata block group, at btrfs_chunk_alloc(), or because
it is removing a block group or turning a block group RO. It does not
matter why;

2) Task A sees that there is not enough free space in the system
space_info object, that is 'left' is < 'thresh'. And at this point
the system space_info has a value of 0 for its 'bytes_may_use'
counter;

3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
a new system block group (chunk) and then reserves 'thresh' bytes in
the chunk block reserve with the call to btrfs_block_rsv_add(). This
changes the chunk block reserve's 'reserved' and 'size' counters by an
amount of 'thresh', and changes the 'bytes_may_use' counter of the
system space_info object from 0 to 'thresh'.

Also during its call to btrfs_alloc_chunk(), we end up increasing the
value of the 'total_bytes' counter of the system space_info object by
8MiB (the size of a system chunk stripe). This happens through the
call chain:

btrfs_alloc_chunk()
create_chunk()
btrfs_make_block_group()
btrfs_update_space_info()

4) After it finishes the first phase of the block group allocation, at
btrfs_chunk_alloc(), task A unlocks the chunk mutex;

5) At this point the new system block group was added to the transaction
handle's list of new block groups, but its block group item, device
items and chunk item were not yet inserted in the extent, device and
chunk trees, respectively. That only happens later when we call
btrfs_finish_chunk_alloc() through a call to
btrfs_create_pending_block_groups();

Note that only when we update the chunk tree, through the call to
btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
of the chunk block reserve as we COW/allocate extent buffers,
through:

btrfs_alloc_tree_block()
btrfs_use_block_rsv()
btrfs_block_rsv_use_bytes()

And the system space_info's 'bytes_may_use' is decremented everytime
we allocate an extent buffer for COW operations on the chunk tree,
through:

btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
btrfs_add_reserved_bytes()

If we end up COWing less chunk btree nodes/leaves than expected, which
is the typical case since the amount of space we reserve is always
pessimistic to account for the worst possible case, we release the
unused space through:

btrfs_create_pending_block_groups()
btrfs_trans_release_chunk_metadata()
btrfs_block_rsv_release()
block_rsv_release_bytes()
btrfs_space_info_free_bytes_may_use()

But before task A gets into btrfs_create_pending_block_groups()...

6) Many other tasks start allocating new block groups through fallocate,
each one does the first phase of block group allocation in a
serialized way, since btrfs_chunk_alloc() takes the chunk mutex
before calling check_system_chunk() and btrfs_alloc_chunk().

However before everyone enters the final phase of the block group
allocation, that is, before calling btrfs_create_pending_block_groups(),
new tasks keep coming to allocate new block groups and while at
check_system_chunk(), the system space_info's 'bytes_may_use' keeps
increasing each time a task reserves space in the chunk block reserve.
This means that eventually some other task can end up not seeing enough
free space in the system space_info and decide to allocate yet another
system chunk.

This may repeat several times if yet more new tasks keep allocating
new block groups before task A, and all the other tasks, finish the
creation of the pending block groups, which is when reserved space
in excess is released. Eventually this can result in exhaustion of
system chunk array in the superblock, with btrfs_add_system_chunk()
returning EFBIG, resulting later in a transaction abort.

Even when we don't reach the extreme case of exhausting the system
array, most, if not all, unnecessarily created system block groups
end up being unused since when finishing creation of the first
pending system block group, the creation of the following ones end
up not needing to COW nodes/leaves of the chunk tree, so we never
allocate and deallocate from them, resulting in them never being
added to the list of unused block groups - as a consequence they
don't get deleted by the cleaner kthread - the only exceptions are
if we unmount and mount the filesystem again, which adds any unused
block groups to the list of unused block groups, if a scrub is
run, which also adds unused block groups to the unused list, and
under some circumstances when using a zoned filesystem or async
discard, which may also add unused block groups to the unused list.

So fix this by:

*) Tracking the number of reserved bytes for the chunk tree per
transaction, which is the sum of reserved chunk bytes by each
transaction handle currently being used;

*) When there is not enough free space in the system space_info,
if there are other transaction handles which reserved chunk space,
wait for some of them to complete in order to have enough excess
reserved space released, and then try again. Otherwise proceed with
the creation of a new system chunk.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc0939fc 22-Feb-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between marking inode needs to be logged and log syncing

We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.

1) We are at transaction N;

2) Inode I was previously fsynced in the current transaction so it has:

inode->logged_trans set to N;

3) The inode's root currently has:

root->log_transid set to 1
root->last_log_commit set to 0

Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());

4) One more range of pages is dirtied in inode I;

5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.

Before task A calls btrfs_sync_log()...

6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();

7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:

spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);

So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...

8) Task A is at btrfs_sync_log():

- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1

It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;

9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:

inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);

Which results in inode->last_log_commit being set to 1.
The ordered extent completes;

10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:

inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty

And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.

It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.

However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:

vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)

So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.

So fix this in two different ways:

1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;

2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.

This is an extremely unlikely race to happen in practice.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d3575156 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: redirty released extent buffers

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d0c2f4fa 27-Jan-2021 Filipe Manana <fdmanana@suse.com>

btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

$ cat dbench-test.sh
#!/bin/bash

DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 300 64

umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9627107 0.153 61.938
Close 7072076 0.001 3.175
Rename 407633 1.222 44.439
Unlink 1943895 0.658 44.440
Deltree 256 17.339 110.891
Mkdir 128 0.003 0.009
Qpathinfo 8725406 0.064 17.850
Qfileinfo 1529516 0.001 2.188
Qfsinfo 1599884 0.002 1.457
Sfileinfo 784200 0.005 3.562
Find 3373513 0.411 30.312
WriteX 4802132 0.053 29.054
ReadX 15089959 0.002 5.801
LockX 31344 0.002 0.425
UnlockX 31344 0.001 0.173
Flush 674724 5.952 341.830

Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9931568 0.111 25.597
Close 7295730 0.001 2.171
Rename 420549 0.982 49.714
Unlink 2005366 0.497 39.015
Deltree 256 11.149 89.242
Mkdir 128 0.002 0.014
Qpathinfo 9001863 0.049 20.761
Qfileinfo 1577730 0.001 2.546
Qfsinfo 1650508 0.002 3.531
Sfileinfo 809031 0.005 5.846
Find 3480259 0.309 23.977
WriteX 4952505 0.043 41.283
ReadX 15568127 0.002 5.476
LockX 32338 0.002 0.978
UnlockX 32338 0.001 2.032
Flush 696017 7.485 228.835

Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms

--> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 8956748 0.342 108.312
Close 6579660 0.001 3.823
Rename 379209 2.396 81.897
Unlink 1808625 1.108 131.148
Deltree 256 25.632 172.176
Mkdir 128 0.003 0.018
Qpathinfo 8117615 0.131 55.916
Qfileinfo 1423495 0.001 2.635
Qfsinfo 1488496 0.002 5.412
Sfileinfo 729472 0.007 8.643
Find 3138598 0.855 78.321
WriteX 4470783 0.102 79.442
ReadX 14038139 0.002 7.578
LockX 29158 0.002 0.844
UnlockX 29158 0.001 0.567
Flush 627746 14.168 506.151

Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms

After applying patchset, 64 clients:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9069003 0.303 43.193
Close 6662328 0.001 3.888
Rename 383976 2.194 46.418
Unlink 1831080 1.022 43.873
Deltree 256 24.037 155.763
Mkdir 128 0.002 0.005
Qpathinfo 8219173 0.137 30.233
Qfileinfo 1441203 0.001 3.204
Qfsinfo 1507092 0.002 4.055
Sfileinfo 738775 0.006 5.431
Find 3177874 0.936 38.170
WriteX 4526152 0.084 39.518
ReadX 14213562 0.002 24.760
LockX 29522 0.002 1.221
UnlockX 29522 0.001 0.694
Flush 635652 14.358 422.039

Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms

--> +6.8% throughput, -18.1% max latency

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a2633b6a 24-Nov-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: return bool from btrfs_should_end_transaction

Results in slightly smaller code.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11)
Function old new delta
btrfs_should_end_transaction 96 85 -11
Total: Before=20070, After=20059, chg -0.05%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ecfdc08b 24-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: remove dio iomap DSYNC workaround

This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0eb79294 03-Sep-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: dio iomap DSYNC workaround

iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
This is problematic for us because of 2 reasons:

1. we hold the inode_lock() during this operation, and we take it in
generic_write_sync()
2. we hold a read lock on the dio_sem but take the write lock in fsync

Since we don't want to rip out this code right now, but reworking the
locking is a bit much to do at this point, work around this problem with
this masterpiece of a patch.

First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
that it needs to handle the sync. We save this fact in
current->journal_info, because we need to see do special things once
we're in iomap_begin, and we have no way to pass private information
into iomap_dio_rw().

Next we specify a separate iomap_dio_ops for sync, which implements an
->end_io() callback that gets called when the dio completes. This is
important for AIO, because we really do need to run generic_write_sync()
if we complete asynchronously. However if we're still in the submitting
context when we enter ->end_io() we clear the flag so that the submitter
knows they're the ones that needs to run generic_write_sync().

This is meant to be temporary. We need to work out how to eliminate the
inode_lock() and the dio_sem in our fsync and use another mechanism to
protect these operations.

Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 48778179 10-Aug-2020 Filipe Manana <fdmanana@suse.com>

btrfs: make fast fsyncs wait only for writeback

Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.

Until commit b5e6c3e170b770 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:

https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/

This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.

When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.

This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.

Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).

The following script that calls fio was used:

$ cat test-fsync.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-d single -m single"

if [ $# -ne 5 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
exit 1
fi

NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
WRITE_MODE=$5

if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
exit 1
fi

cat <<EOF > /tmp/fio-job.ini
[writers]
rw=$WRITE_MODE
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo

umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT

The results were the following:

*************************
*** sequential writes ***
*************************

==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec

After patch:

WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)

==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec

After patch:

WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)

==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec

After patch:

WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec

After patch:

WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)

==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec

After patch:

WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)

==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec

After patch:

WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)

==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec

After patch:

WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)

************************
*** random writes ***
************************

==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec

After patch:

WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)

==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec

After patch:

WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)

==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec

After patch:

WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec

After patch:

WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)

==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec

After patch:

WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)

==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec

After patch:

WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)

==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec

After patch:

WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# adca4d94 13-Jul-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT

commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and tries to wake up commit
kthread to free some space.

The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.

Now since we have the ability to flush qgroup space, and implemented
retry-after-EDQUOT behavior, such mechanism can be completely replaced.

So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2dfb1e43 15-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: preallocate anon block device at first phase of snapshot creation

[BUG]
When the anonymous block device pool is exhausted, subvolume/snapshot
creation fails with EMFILE (Too many files open). This has been reported
by a user. The allocation happens in the second phase during transaction
commit where it's only way out is to abort the transaction

BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

[CAUSE]
When the global anonymous block device pool is exhausted, the following
call chain will fail, and lead to transaction abort:

btrfs_ioctl_snap_create_v2()
|- btrfs_ioctl_snap_create_transid()
|- btrfs_mksubvol()
|- btrfs_commit_transaction()
|- create_pending_snapshot()
|- btrfs_get_fs_root()
|- btrfs_init_fs_root()
|- get_anon_bdev()

[FIX]
Although we can't enlarge the anonymous block device pool, at least we
can preallocate anon_dev for subvolume/snapshot in the first phase,
outside of transaction context and exactly at the moment the user calls
the creation ioctl.

Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d9094414 05-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_set_inode_last_trans take btrfs_inode

Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
an input paramter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7f9fe614 13-Mar-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: improve global reserve stealing logic

For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start an
ordinary transaction and if it fails it will fall back to reserving the
required amount by stealing from the global reserve. This is problematic
because of all the same reasons we had with previous iterations of the
ENOSPC handling, thundering herd. We get a bunch of failures all at
once, everybody tries to allocate from the global reserve, some win and
some lose, we get an ENSOPC.

Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure. We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve. If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fe119a6e 20-Jan-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: switch to per-transaction pinned extents

This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.

Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bf31f87f 05-Feb-2020 David Sterba <dsterba@suse.com>

btrfs: add wrapper for transaction abort predicate

The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8d510121 08-Oct-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Rename btrfs_join_transaction_nolock

This function is used only during the final phase of freespace cache
writeout. This is necessary since using the plain btrfs_join_transaction
api is deadlock prone. The deadlock looks like:

T1:
btrfs_commit_transaction
commit_cowonly_roots
btrfs_write_dirty_block_groups
btrfs_wait_cache_io
__btrfs_wait_cache_io
btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
inode and blocks transaction commit
until freespace cache writeout

T2: <-- after T1 has triggered the writeout
finish_ordered_fn
btrfs_finish_ordered_io
btrfs_join_transaction <--- this would block waiting for current
transaction to commit, but since trans
commit is waiting for this writeout to
finish

The special purpose functions prevents it by simply skipping the "wait
for writeout" since it's guaranteed the transaction won't proceed until
we are done.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3296bf56 22-Aug-2019 Qu Wenruo <wqu@suse.com>

btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED

The state was introduced in commit 4a9d8bdee368 ("Btrfs: make the state
of the transaction more readable"), then in commit 302167c50b32
("btrfs: don't end the transaction for delayed refs in throttle") the
state is completely removed.

So we can just clean up the state since it's only compared but never
set.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b9fae2eb 11-Sep-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: make btrfs_wait_extents() static

It's not used ouside of transaction.c

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a6d155d2 29-Jul-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: fix deadlock between fiemap and transaction commits

The fiemap handler locks a file range that can have unflushed delalloc,
and after locking the range, it tries to attach to a running transaction.
If the running transaction started its commit, that is, it is in state
TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
flushoncommit option or the transaction is creating a snapshot for the
subvolume that contains the file that fiemap is operating on, we end up
deadlocking. This happens because fiemap is blocked on the transaction,
waiting for it to complete, and the transaction is waiting for the flushed
dealloc to complete, which requires locking the file range that the fiemap
task already locked. The following stack traces serve as an example of
when this deadlock happens:

(...)
[404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[404571.515956] Call Trace:
[404571.516360] ? __schedule+0x3ae/0x7b0
[404571.516730] schedule+0x3a/0xb0
[404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs]
[404571.517465] ? remove_wait_queue+0x60/0x60
[404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs]
[404571.518202] normal_work_helper+0xea/0x530 [btrfs]
[404571.518566] process_one_work+0x21e/0x5c0
[404571.518990] worker_thread+0x4f/0x3b0
[404571.519413] ? process_one_work+0x5c0/0x5c0
[404571.519829] kthread+0x103/0x140
[404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70
[404571.520565] ret_from_fork+0x3a/0x50
[404571.520915] kworker/u8:6 D 0 31651 2 0x80004000
[404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
(...)
[404571.537000] fsstress D 0 13117 13115 0x00004000
[404571.537263] Call Trace:
[404571.537524] ? __schedule+0x3ae/0x7b0
[404571.537788] schedule+0x3a/0xb0
[404571.538066] wait_current_trans+0xc8/0x100 [btrfs]
[404571.538349] ? remove_wait_queue+0x60/0x60
[404571.538680] start_transaction+0x33c/0x500 [btrfs]
[404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs]
[404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs]
[404571.539866] extent_fiemap+0x2ce/0x650 [btrfs]
[404571.540170] do_vfs_ioctl+0x526/0x6f0
[404571.540436] ksys_ioctl+0x70/0x80
[404571.540734] __x64_sys_ioctl+0x16/0x20
[404571.540997] do_syscall_64+0x60/0x1d0
[404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[404571.543729] btrfs D 0 14210 14208 0x00004000
[404571.544023] Call Trace:
[404571.544275] ? __schedule+0x3ae/0x7b0
[404571.544526] ? wait_for_completion+0x112/0x1a0
[404571.544795] schedule+0x3a/0xb0
[404571.545064] schedule_timeout+0x1ff/0x390
[404571.545351] ? lock_acquire+0xa6/0x190
[404571.545638] ? wait_for_completion+0x49/0x1a0
[404571.545890] ? wait_for_completion+0x112/0x1a0
[404571.546228] wait_for_completion+0x131/0x1a0
[404571.546503] ? wake_up_q+0x70/0x70
[404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
[404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
[404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
[404571.547703] ? remove_wait_queue+0x60/0x60
[404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs]
[404571.548226] ? __sb_start_write+0xd4/0x1c0
[404571.548512] ? mnt_want_write_file+0x24/0x50
[404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
[404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
[404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs]
[404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0
[404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0
[404571.550064] ? __handle_mm_fault+0xe3e/0x11f0
[404571.550306] ? do_raw_spin_unlock+0x49/0xc0
[404571.550608] ? _raw_spin_unlock+0x24/0x30
[404571.550976] ? __handle_mm_fault+0xedf/0x11f0
[404571.551319] ? do_vfs_ioctl+0xa2/0x6f0
[404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[404571.552087] do_vfs_ioctl+0xa2/0x6f0
[404571.552355] ksys_ioctl+0x70/0x80
[404571.552621] __x64_sys_ioctl+0x16/0x20
[404571.552864] do_syscall_64+0x60/0x1d0
[404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)

If we were joining the transaction instead of attaching to it, we would
not risk a deadlock because a join only blocks if the transaction is in a
state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
flush performed by a transaction is done before it reaches that state,
when it is in the state TRANS_STATE_COMMIT_START. However a transaction
join is intended for use cases where we do modify the filesystem, and
fiemap only needs to peek at delayed references from the current
transaction in order to determine if extents are shared, and, besides
that, when there is no current transaction or when it blocks to wait for
a current committing transaction to complete, it creates a new transaction
without reserving any space. Such unnecessary transactions, besides doing
unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
rotation of the precious backup roots.

So fix this by adding a new transaction join variant, named join_nostart,
which behaves like the regular join, but it does not create a transaction
when none currently exists or after waiting for a committing transaction
to complete.

Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb6dea26 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate btrfs_trans_release_chunk_metadata

Move this into transaction.c with the rest of the transaction related
code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 74f657d8 15-Apr-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: remove no longer used member num_dirty_bgs from transaction

The member num_dirty_bgs of struct btrfs_transaction is not used anymore,
it is set and incremented but nothing reads its value anymore. Its last
read use was removed by commit 64403612b73a94 ("btrfs: rework
btrfs_check_space_for_delayed_refs"). So just remove that member.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1c11b63e 27-Mar-2019 Jeff Mahoney <jeffm@suse.com>

btrfs: replace pending/pinned chunks lists with io tree

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bbbf7243 25-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: combine device update operations during transaction commit

We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.

We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3b1da515 11-Mar-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: remove no longer used 'sync' member from transaction handle

Commit db2462a6ad3d ("btrfs: don't run delayed refs in the end transaction
logic") removed its last use, so now it does absolutely nothing, therefore
remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bbe339cc 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: drop extra enum initialization where using defaults

The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 85dd506c 26-Oct-2018 Filipe Manana <fdmanana@suse.com>

Btrfs: remove no longer used stuff for tracking pending ordered extents

Tracking pending ordered extents per transaction was introduced in commit
50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change
how we wait for pending ordered extents").

However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a944442c 12-Jun-2018 Allen Pais <allen.lkml@gmail.com>

btrfs: replace get_seconds with new 64bit time API

The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b0cb1f9 29-May-2018 Gu JinXiang <gujx@cn.fujitsu.com>

btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot

Since there is no more use of qgroup_reserved member in struct
btrfs_pending_snapshot, remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a514d638 22-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT

Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.

Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit. (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9888c340 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- headers

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>


# bcf3a3e7 05-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove code referencing unused TRANS_USERSPACE

Now that the userspace transaction ioctls have been removed,
TRANS_USERSPACE is no longer used hence we can remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 45ae2c18 08-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Document consistency of transaction->io_bgs list

The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7806c6eb 14-Dec-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused btrfs_start_transaction_lflush function

Commit 0e8c36a9fd81 ("Btrfs: fix lots of orphan inodes when the space
is not enough") changed the way transaction reservation is made in
btrfs_evict_node and as a result this function became unused. This has
been the status quo for 5 years in which time no one noticed, so I'd
say it's safe to assume it's unlikely it will ever be used again.

Historical note: there were more attempts to remove the function, the
reasoning was missing and only based on some static analysis tool
reports. Other reason for rejection was that there seemed to be
connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be
removeed to. This was not correct so removing the function is all we can
do.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 5302e089 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: reorder btrfs_transaction members for better packing

There are now 20 bytes of holes, we can reduce that to 4 by minor
changes. Moving 'aborted' to the status and flags is also more logical,
similar for num_dirty_bgs. The size goes from 432 to 416.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 165c8b02 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: use narrower type for btrfs_transaction::num_dirty_bgs

The u64 is an overkill here, we could not possibly create that many
blockgroups in one transaction.

Signed-off-by: David Sterba <dsterba@suse.com>


# 1ca4bb63 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: reorder btrfs_trans_handle members for better packing

Recent updates to the structure left some holes, reorder the types so
the packing is tight. The size goes from 112 to 104 on 64bit.

Signed-off-by: David Sterba <dsterba@suse.com>


# b50fff81 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: switch to refcount_t type for btrfs_trans_handle::use_count

The use_count is a reference counter, we can use the refcount_t type,
though we don't use the atomicity. This is not a performance critical
code and we could catch the underflows. The type is changed from long,
but the number of references will fit an int.

Signed-off-by: David Sterba <dsterba@suse.com>


# 2dbda74e 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused member of btrfs_trans_handle

Last user was removed in a monster commit a22285a6a32390195235171
("Btrfs: Integrate metadata reservation with start_transaction") in
2010.

Signed-off-by: David Sterba <dsterba@suse.com>


# 7c2871a2 07-Nov-2017 David Sterba <dsterba@suse.com>

btrfs: switch btrfs_trans_handle::adding_csums to bool

The semantics of adding_csums matches bool, 'short' was most likely used
to save space in a698d0755adb6f2 ("Btrfs: add a type field for the
transaction handle").

Signed-off-by: David Sterba <dsterba@suse.com>


# f486135e 15-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused qgroup members from btrfs_trans_handle

The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562dec83b3), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9b64f57d 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3a45bb20 09-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: remove root parameter from transaction commit/end routines

Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bf89d38f 09-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: split btrfs_wait_marked_extents into normal and tree log functions

btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
btrfs_wait_marked_extents, which provides a core loop and then handles
errors differently based on whether it's it's a log root or not.

This means that btrfs_write_and_wait_marked_extents needs to take a root
because btrfs_wait_marked_extents requires one, even though it's only
used to determine whether the root is a log root. The log root code
won't ever call into the transaction commit code using a log root, so we
can factor out the core loop and provide the error handling appropriate
to each waiter in new routines. This allows us to eventually remove
the root argument from btrfs_commit_transaction, and as a result,
btrfs_end_transaction.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab8d0fc4 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert pr_* to btrfs_* where possible

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 64b63580 20-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: add btrfs_trans_handle->fs_info pointer

btrfs_trans_handle->root is documented as for use for confirming
that the root passed in to start the transaction is the same as the
one ending it. It's used in several places when an fs_info pointer
is needed, so let's just add an fs_info pointer directly. Eventually,
the root pointer can be removed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 64c12921 07-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: account for non-CoW'd blocks in btrfs_abort_transaction

The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor. btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction. trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion. In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 01327610 19-May-2016 Nicholas D Steeves <nsteeves@gmail.com>

btrfs: fix string and comment grammatical issues and typos

Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8546b570 10-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: preallocate path for snapshot creation at ioctl time

We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.

Signed-off-by: David Sterba <dsterba@suse.com>


# b0c0ea63 10-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: allocate root item at snapshot ioctl time

The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.

Signed-off-by: David Sterba <dsterba@suse.com>


# 348a0013 26-Nov-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list

As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
10264 {
(...)
10405 if (trimming) {
10406 WARN_ON(!list_empty(&block_group->bg_list));
10407 spin_lock(&trans->transaction->deleted_bgs_lock);
10408 list_move(&block_group->bg_list,
10409 &trans->transaction->deleted_bgs);
10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
10411 btrfs_get_block_group(block_group);
10412 }
(...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

CPU 1 CPU 2

cleaner_kthread()
btrfs_delete_unused_bgs()

sees bg X in list
fs_info->unused_bgs

deletes bg X from list
fs_info->unused_bgs

scrub_enumerate_chunks()

searches device tree using
its commit root

finds device extent for
block group X

gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())

sets bg X to RO (again)

scrub_chunk(bg X)

sets bg X back to RW mode

adds bg X to the list
fs_info->unused_bgs again,
since it's still unused and
currently not in that list

sets bg X to RO mode

btrfs_remove_chunk(bg X)

--> discard is enabled and bg X
is in the fs_info->unused_bgs
list again so the warning is
triggered
--> we move it from that list into
the transaction's delete_bgs
list, but we can have another
task currently manipulating
the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

Signed-off-by: Filipe Manana <fdmanana@suse.com>


# 8eab77ff 13-Nov-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: use global reserve when deleting unused block group after ENOSPC

It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2968b1f4 30-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: don't continue setting up space cache when enospc

If we hit ENOSPC when setting up a space cache don't bother setting up any of
the other space cache's in this transaction, it'll just induce unnecessary
latency. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 3204d33c 24-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: add a flags field to btrfs_transaction

I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 161c3549 24-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: change how we wait for pending ordered extents

We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running. We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list. However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.

To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction. Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7174109c 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Use new metadata reservation.

As we have the new metadata reservation functions, use them to replace
the old btrfs_qgroup_reserve() call for metadata.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5aed1dd8 22-Sep-2015 Alexandru Moise <00moses.alexander00@gmail.com>

btrfs: change num_items type from u64 to unsigned int

The value of num_items that start_transaction() ultimately
always takes is a small one, so a 64 bit integer is overkill.

Also change num_items for btrfs_start_transaction() and
btrfs_start_transaction_lflush() as well.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d9a0540a 03-Oct-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix deadlock when finalizing block group creation

Josef ran into a deadlock while a transaction handle was finalizing the
creation of its block groups, which produced the following trace:

[260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084
[260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
[260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
[260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
[260445.593137] Call Trace:
[260445.593145] [<ffffffff8175a437>] schedule+0x37/0x80
[260445.593189] [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
[260445.593197] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
[260445.593225] [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
[260445.593253] [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
[260445.593295] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
[260445.593324] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[260445.593351] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[260445.593394] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
[260445.593427] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
[260445.593459] [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
[260445.593491] [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
[260445.593524] [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
[260445.593532] [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
[260445.593564] [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
[260445.593597] [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
[260445.593626] [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
[260445.593654] [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[260445.593682] [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
[260445.593724] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
[260445.593752] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[260445.593830] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[260445.593905] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
[260445.593946] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
[260445.593990] [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
[260445.594042] [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
[260445.594089] [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
[260445.594115] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
[260445.594133] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
[260445.594149] [<ffffffff8123e35d>] do_fsync+0x3d/0x70
[260445.594169] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
[260445.594187] [<ffffffff8123e600>] SyS_fsync+0x10/0x20
[260445.594204] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

This happened because the same transaction handle created a large number
of block groups and while finalizing their creation (inserting new items
and updating existing items in the chunk and device trees) a new metadata
extent had to be allocated and no free space was found in the current
metadata block groups, which made find_free_extent() attempt to allocate
a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
ended up allocating a new system chunk too and exceeded the threshold
of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
final part of block group creation again (at
btrfs_create_pending_block_groups()) and attempt to lock again the root
of the chunk tree when it's already write locked by the same task.

Similarly we can deadlock on extent tree nodes/leafs if while we are
running delayed references we end up creating a new metadata block group
in order to allocate a new node/leaf for the extent tree (as part of
a CoW operation or growing the tree), as btrfs_create_pending_block_groups
inserts items into the extent tree as well. In this case we get the
following trace:

[14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084
[14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
[14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
[14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
[14242.773606] Call Trace:
[14242.773613] [<ffffffff8175a437>] schedule+0x37/0x80
[14242.773656] [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
[14242.773664] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
[14242.773692] [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
[14242.773720] [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
[14242.773750] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[14242.773758] [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
[14242.773786] [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
[14242.773818] [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
[14242.773850] [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
[14242.773934] [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
[14242.773998] [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
[14242.774041] [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
[14242.774078] [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
[14242.774118] [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[14242.774155] [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
[14242.774194] [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
[14242.774235] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[14242.774274] [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[14242.774318] [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
[14242.774358] [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
[14242.774391] [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
[14242.774432] [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
[14242.774474] [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
[14242.774516] [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
[14242.774558] [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
[14242.774599] [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
[14242.774642] [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
[14242.774650] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
[14242.774657] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
[14242.774663] [<ffffffff8123e35d>] do_fsync+0x3d/0x70
[14242.774669] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
[14242.774675] [<ffffffff8123e600>] SyS_fsync+0x10/0x20
[14242.774681] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by never recursing into the finalization phase of block group
creation and making sure we never trigger the finalization of block group
creation while running delayed references.

Reported-by: Josef Bacik <jbacik@fb.com>
Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock")
Signed-off-by: Filipe Manana <fdmanana@suse.com>


# 2b9dbef2 15-Sep-2015 Josef Bacik <jbacik@fb.com>

Btrfs: keep dropped roots in cache until transaction commit

When dropping a snapshot we need to account for the qgroup changes. If we drop
the snapshot in all one go then the backref code will fail to find blocks from
the snapshot we dropped since it won't be able to find the root in the fs root
cache. This can lead to us failing to find refs from other roots that pointed
at blocks in the now deleted root. To handle this we need to not remove the fs
roots from the cache until after we process the qgroup operations. Do this by
adding dropped roots to a list on the transaction, and letting the transaction
remove the roots at the same time it drops the commit roots. This will keep all
of the backref searching code in sync properly, and fixes a problem Mark was
seeing with snapshot delete and qgroups. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e33e17ee 15-Jun-2015 Jeff Mahoney <jeffm@suse.com>

btrfs: add missing discards when unpinning extents with -o discard

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 9086db86 19-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.

This is used by later qgroup fix patches for snapshot.

As current snapshot accounting is done by btrfs_qgroup_inherit(), but
new extent oriented quota mechanism will account extent from
btrfs_copy_root() and other snapshot things, causing wrong result.

So add this ability to handle snapshot accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4fbcdf66 20-May-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix -ENOSPC when finishing block group creation

While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:

[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.

So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.

Fix this by doing proper reservation in the chunk block reserve.

The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1bbc621e 06-Apr-2015 Chris Mason <clm@fb.com>

Btrfs: allow block group cache writeout outside critical section in commit

We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>


# cb723e49 18-Feb-2015 Josef Bacik <jbacik@fb.com>

Btrfs: reserve space for block groups

This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason <clm@fb.com>


# 2f2ff0ee 20-Mar-2015 Filipe Manana <fdmanana@suse.com>

Btrfs: fix metadata inconsistencies after directory fsync

We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.

Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.

This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:

_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey

# Create our main test file and directory.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
mkdir $SCRATCH_MNT/mydir

# Make sure all metadata and data are durably persisted.
sync

# Add a hard link to 'foo' inside our test directory and fsync only the
# directory. The btrfs fsync implementation had a bug that caused the new
# directory entry to be visible after the fsync log replay but, the inode
# of our file remained with a link count of 1.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

# Add a few more links and new files.
# This is just to verify nothing breaks or gives incorrect results after the
# fsync log is replayed.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
$XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

# Add some subdirectories and new files and links to them. This is to verify
# that after fsyncing our top level directory 'mydir', all the subdirectories
# and their files/links are registered in the fsync log and exist after the
# fsync log is replayed.
mkdir -p $SCRATCH_MNT/mydir/x/y/z
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
touch $SCRATCH_MNT/mydir/x/y/z/qwerty

# Now fsync only our top directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

# And fsync now our new file named 'hello', just to verify later that it has
# the expected content and that the previous fsync on the directory 'mydir' had
# no bad influence on this fsync.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey

_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey

# Verify the content of our file 'foo' remains the same as before, 8192 bytes,
# all with the value 0xaa.
echo "File 'foo' content after log replay:"
od -t x1 $SCRATCH_MNT/foo

# Remove the first name of our inode. Because of the directory fsync bug, the
# inode's link count was 1 instead of 5, so removing the 'foo' name ended up
# deleting the inode and the other names became stale directory entries (still
# visible to applications). Attempting to remove or access the remaining
# dentries pointing to that inode resulted in stale file handle errors and
# made it impossible to remove the parent directories since it was impossible
# for them to become empty.
echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
rm -f $SCRATCH_MNT/foo

# Now verify that all files, links and directories created before fsyncing our
# directory exist after the fsync log was replayed.
[ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
[ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
[ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
[ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
[ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
echo "Link mydir/x/y/foo_y_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
echo "Link mydir/x/y/z/foo_z_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
echo "File mydir/x/y/z/qwerty is missing"

# We expect our file here to have a size of 64Kb and all the bytes having the
# value 0xff.
echo "file 'hello' content after log replay:"
od -t x1 $SCRATCH_MNT/hello

# Now remove all files/links, under our test directory 'mydir', and verify we
# can remove all the directories.
rm -f $SCRATCH_MNT/mydir/x/y/z/*
rmdir $SCRATCH_MNT/mydir/x/y/z
rm -f $SCRATCH_MNT/mydir/x/y/*
rmdir $SCRATCH_MNT/mydir/x/y
rmdir $SCRATCH_MNT/mydir/x
rm -f $SCRATCH_MNT/mydir/*
rmdir $SCRATCH_MNT/mydir

# An fsck, run by the fstests framework everytime a test finishes, also detected
# the inconsistency and printed the following error message:
#
# root 5 inode 257 errors 2001, no inode item, link count wrong
# unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
# unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

status=0
exit

The expected golden output for the test is:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 5
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000

Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 1
Link mydir/foo_2 is missing
Link mydir/foo_3 is missing
Link mydir/x/y/foo_y_link is missing
Link mydir/x/y/z/foo_z_link is missing
File mydir/x/y/z/qwerty is missing
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

Fsck, without this fix, also complains about the wrong link count:

root 5 inode 257 errors 2001, no inode item, link count wrong
unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

So fix this by logging the inodes that the dentries point to when
fsyncing a directory.

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 13212b54 11-Feb-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: Fix out-of-space bug

Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.

Steps to reproduce:
1: Create a single-dev btrfs fs with default option
2: Write a file into it to take up most fs space
3: Delete above file
4: Wait about 100s to let chunk removed
5: goto 2

Script is like following:
#!/bin/bash

# Recommend 1.2G space, too large disk will make test slow
DEV="/dev/sda16"
MNT="/mnt/tmp"

dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
echo "mkfs $DEV"
mkfs.btrfs -f "$DEV" >/dev/null || exit 2
echo "mount $DEV $MNT"
mount "$DEV" "$MNT" || exit 2

for ((loop_i = 0; loop_i < 20; loop_i++)); do
echo
echo "loop $loop_i"

echo "dd file..."
cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
"${cmd[@]}" 2>/dev/null || {
# NO_SPACE error triggered
echo "dd failed: ${cmd[*]}"
exit 1
}

echo "rm file..."
rm -f "$MNT"/file0 || exit 2

for ((i = 0; i < 10; i++)); do
df "$MNT" | tail -1
sleep 10
done
done

Reason:
It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
which is used to remove empty block groups automatically, but the
reason is not in that patch. Code before works well because btrfs
don't need to create and delete chunks so many times with high
complexity.
Above bug is caused by many reason, any of them can trigger it.

Reason1:
When we remove some continuous chunks but leave other chunks after,
these disk space should be used by chunk-recreating, but in current
code, only first create will successed.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason2:
contains_pending_extent() return wrong value in calculation.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason3:
btrfs_check_data_free_space() try to commit transaction and retry
allocating chunk when the first allocating failed, but space_info->full
is set in first allocating, and prevent second allocating in retry.
Fixed in this patch by clear space_info->full in commit transaction.

Tested for severial times by above script.

Changelog v3->v4:
use light weight int instead of atomic_t to record have_remove_bgs in
transaction, suggested by:
Josef Bacik <jbacik@fb.com>

Changelog v2->v3:
v2 fixed the bug by adding more commit-transaction, but we
only need to reclaim space when we are really have no space for
new chunk, noticed by:
Filipe David Manana <fdmanana@gmail.com>

Actually, our code already have this type of commit-and-retry,
we only need to make it working with removed-bgs.
v3 fixed the bug with above way.

Changelog v1->v2:
v1 will introduce a new bug when delete and create chunk in same disk
space in same transaction, noticed by:
Filipe David Manana <fdmanana@gmail.com>
V2 fix this bug by commit transaction after remove block grops.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ce93ec54 17-Nov-2014 Josef Bacik <jbacik@fb.com>

Btrfs: track dirty block groups on their own list

Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set. This function
can get called several times during a commit. So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time. Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups. This allows us to clean up how we write the
free space cache out so it is much cleaner. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 50d9aa99 21-Nov-2014 Josef Bacik <jbacik@fb.com>

Btrfs: make sure logged extents complete in the current transaction V3

Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 663dfbb0 12-Oct-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: deal with convert_extent_bit errors to avoid fs corruption

When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.

That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.

Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 572d9ab7 05-Feb-2014 David Sterba <dsterba@suse.cz>

btrfs: add support for processing pending changes

There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.

For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).

Signed-off-by: David Sterba <dsterba@suse.cz>


# 2755a0de 30-Jul-2014 David Sterba <dsterba@suse.cz>

btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB

Signed-off-by: David Sterba <dsterba@suse.cz>


# 8d875f95 12-Aug-2014 Chris Mason <clm@fb.com>

btrfs: disable strict file flushes for renames and truncates

Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>


# faa2dbf0 07-May-2014 Josef Bacik <jbacik@fb.com>

Btrfs: add sanity tests for new qgroup accounting code

This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 9e351cc8 13-Mar-2014 Josef Bacik <jbacik@fb.com>

Btrfs: remove transaction from send

Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a26e8c9f 28-Mar-2014 Josef Bacik <jbacik@fb.com>

Btrfs: don't clear uptodate if the eb is under IO

So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5039eddc 15-Jan-2014 Josef Bacik <jbacik@fb.com>

Btrfs: make fsync latency less sucky

Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# a56dbd89 25-Dec-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: remove btrfs_end_transaction_dmeta()

Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
workers should not commit the transaction, instead, deal with the delayed
items as many as possible.

So we can remove btrfs_end_transaction_dmeta()

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 20dd2cbf 25-Sep-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix BUG_ON() casued by the reserved space migration

When we did space balance and snapshot creation at the same time, we might
meet the following oops:
kernel BUG at fs/btrfs/inode.c:3038!
[SNIP]
Call Trace:
[<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
[<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
[<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
[<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
[<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
[<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
[<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
[<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
[<ffffffff81121e88>] SyS_ioctl+0x57/0x83
[<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
[<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]

The reason of the problem is that the relocation root creation stole
the reserved space, which was reserved for orphan item deletion.

There are several ways to fix this problem, one is to increasing
the reserved space size of the space balace, and then we can use
that space to create the relocation tree for each fs/file trees.
But it is hard to calculate the suitable size because we doesn't
know how many fs/file trees we need relocate.

We fixed this problem by reserving the space for relocation root creation
actively since the space it need is very small (one tree block, used for
root node copy), then we use that reserved space to create the
relocation tree. If we don't reserve space for relocation tree creation,
we will use the reserved space of the balance.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 724e2315 30-Sep-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix two use-after-free bugs with transaction cleanup

I was noticing the slab redzone stuff going off every once and a while during
transaction aborts. This was caused by two things

1) We would walk the pending snapshots and set their error to -ECANCELED. We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit. Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.

2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter. To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.

I also moved the global "kill all things dirty everywhere" stuff outside of the
transaction cleanup loop since that only needs to be done once. With this patch
I'm no longer seeing slab corruption because of use after frees. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 171170c1 14-Aug-2013 Sergei Trofimovich <slyfox@gentoo.org>

btrfs: mark some local function as 'static'

Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# cfad392b 25-Jul-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: check to see if root_list is empty before adding it to dead roots

A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice. To fix this check to see if we are empty before adding ourselves to the
dead roots list. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 6df9a95e 27-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: make the chunk allocator completely tree lockless

When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock. To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff. We still keep the in-memory
stuff which makes sure everything is consistent.

One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice. This has the side
effect of fixing a bug with balance that has been there ever since balance
existed. Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction. So if you happen to crash
during a balance you could come back to a completely broken file system. This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits. This has passed all of the xfstests and my super annoying stress test
followed by a balance. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 4a9d8bde 16-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make the state of the transaction more readable

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 3f1e3fa6 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure

We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 0860adfd 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't wait for all the writers circularly during the transaction commit

btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
// attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
(should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 48a3b636 25-Apr-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 9d1a2a3a 12-Mar-2013 David Sterba <dsterba@suse.cz>

btrfs: clean snapshots one by one

Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.

A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.

The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d5c12070 28-Feb-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix wrong reserved space in qgroup during snap/subv creation

There are two problems in the space reservation of the snapshot/
subvolume creation.
- don't reserve the space for the root item insertion
- the space which is reserved in the qgroup is different with
the free space reservation. we need reserve free space for
7 items, but in qgroup reservation, we need reserve space only
for 3 items.

So we implement new metadata reservation functions for the
snapshot/subvolume creation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# e9662f70 28-Feb-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot

Since we have grabbed the parent inode at the beginning of the
snapshot creation, and both sync and async snapshot creation
release it after the pending snapshots are actually created,
it is safe to access the parent inode directly during the snapshot
creation, we needn't use dget_parent/dput to fix the parent dentry
and get the dir inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# d4edf39b 20-Feb-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix uncompleted transaction

In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.

But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.

We fixes this problem by introducing a function:
btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 569e0f35 13-Feb-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: place ordered operations on a per transaction list

Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# de78b51a 31-Jan-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: remove cache only arguments from defrag path

The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell. Chris says they're dead code, so remove them.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# c6b305a8 18-Dec-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: don't re-enter when allocating a chunk

If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry. The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry. Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 08e007d2 16-Oct-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: improve the noflush reservation

In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 354aa0fb 20-Sep-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix orphan transaction on the freezed filesystem

With the following debug patch:

static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}

I found there was a orphan transaction after the freeze operation was done.

It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.

So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.

This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.

Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# a698d075 20-Sep-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: add a type field for the transaction handle

This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 60376ce4 14-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: fix race in sync and freeze again

I screwed this up, there is a race between checking if there is a running
transaction and actually starting a transaction in sync where we could race
with a freezer and get ourselves into trouble. To fix this we need to make
a new join type to only do the try lock on the freeze stuff. If it fails
we'll return EPERM and just return from sync. This fixes a hang Liu Bo
reported when running xfstest 68 in a loop. Thanks,

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# ea658bad 11-Sep-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: delay block group item insertion

So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 8407aa46 07-Sep-2012 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix corrupted metadata in the snapshot

When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.

We fix this problem by updating the inode item before the current transaction ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 46d8bc34 29-Aug-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix a bug in checking whether a inode is already in log

This is based on Josef's "Btrfs: turbo charge fsync".

The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].

But the problem is that this root->last_log_commit is shared among
inodes.

Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.

This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.

[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>


# 0e721106 26-Jun-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: change how we indicate we're adding csums

There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv. Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work. The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit. By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often. This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 6f72c7e2 14-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: add qgroup inheritance

When creating a subvolume or snapshot, it is necessary
to initialize the qgroup account with a copy of some
other (tracking) qgroup. This patch adds parameters
to the ioctls to pass the information from which qgroup
to inherit.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# c5567237 14-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: hooks to reserve qgroup space

Like block reserves, reserve a small piece of space on each
transaction start and for delalloc. These are the hooks that
can actually return EDQUOT to the user.
The amount of space reserved is tracked in the transaction
handle.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# bed92eae 28-Jun-2012 Arne Jansen <sensille@gmx.net>

Btrfs: qgroup implementation and prototypes

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# d13603ef 13-Sep-2011 Arne Jansen <sensille@gmx.net>

Btrfs: check the root passed to btrfs_end_transaction

This patch only add a consistancy check to validate that the
same root is passed to start_transaction and end_transaction.
Subvolume quota depends on this.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# 49b25e05 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: enhance transaction abort infrastructure

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# d82a6f1d 11-May-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill BTRFS_I(inode)->block_group

Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# a4abeea4 11-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: kill trans_mutex

We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 2a1eb461 13-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: if we've already started a trans handle, use that one

We currently track trans handles in current->journal_info, but we don't actually
use it. This patch fixes it. This will cover the case where we have multiple
people starting transactions down the call chain. This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return. I tested this with xfstests
and it worked out fine. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 7a7eaa40 12-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: take away the num_items argument from btrfs_join_transaction

I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# 16cdcec7 22-Apr-2011 Miao Xie <miaox@cn.fujitsu.com>

btrfs: implement delayed inode items operation

Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030

After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 621496f4 03-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove unused function prototypes

function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>


# 13c5a93e 11-Apr-2011 Josef Bacik <josef@redhat.com>

Btrfs: avoid taking the trans_mutex in btrfs_end_transaction

I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction. So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction. Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# b83cc969 20-Dec-2010 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: Add readonly snapshots support

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

Implementation:

- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

Changelog for v3:

- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 46204592 29-Oct-2010 Sage Weil <sage@newdream.net>

Btrfs: add START_SYNC, WAIT_SYNC ioctls

START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# bb9c12c9 29-Oct-2010 Sage Weil <sage@newdream.net>

Btrfs: async transaction commit

Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0af3d00b 21-Jun-2010 Josef Bacik <josef@redhat.com>

Btrfs: create special free space cache inode

In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>


# 8929ecfa 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Introduce global metadata reservation

Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a22285a6 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Integrate metadata reservation with start_transaction

Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f0486c68 16-May-2010 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Introduce contexts for metadata reservation

Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8cef4e16 12-Nov-2009 Yan, Zheng <zheng.yan@oracle.com>

Btrfs: Avoid superfluous tree-log writeout

We allow two log transactions at a time, but use same flag
to mark dirty tree-log btree blocks. So we may flush dirty
blocks belonging to newer log transaction when committing a
log transaction. This patch fixes the issue by using two
flags to mark dirty tree-log btree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 690587d1 13-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: streamline tree-log btree block writeout

Syncing the tree log is a 3 phase operation.

1) write and wait for all the tree log blocks for a given root.

2) write and wait for all the tree log blocks for the
tree of tree log roots.

3) write and wait for the super blocks (barriers here)

This isn't as efficient as it could be because there is
no requirement to wait for the blocks from step one to hit the disk
before we start writing the blocks from step two. This commit
changes the sequence so that we don't start waiting until
all the tree blocks from both steps one and two have been sent
to disk.

We do this by breaking up btrfs_write_wait_marked_extents into
two functions, which is trivial because it was already broken
up into two parts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 257c62e1 13-Oct-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: avoid tree log commit when there are no changes

rpm has a habit of running fdatasync when the file hasn't
changed. We already detect if a file hasn't been changed
in the current transaction but it might have been sent to
the tree-log in this transaction and not changed since
the last call to fsync.

In this case, we want to avoid a tree log sync, which includes
a number of synchronous writes and barriers. This commit
extends the existing tracking of the last transaction to change
a file to also track the last sub-transaction.

The end result is that rpm -ivh and -Uvh are roughly twice as fast,
and on par with ext3.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f36f3042 30-Jul-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: be more polite in the async caching threads

The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall. This
releases the semaphore more often when a transaction commit is
in progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5d4f98a2 10-Jun-2009 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)

This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b7ec40d7 12-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: reduce stalls during transaction commit

To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.

This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait. This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.

It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish. This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 56bec294 13-Mar-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: do extent allocation and reference count updates in the background

The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d397712b 05-Jan-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix checkpatch.pl warnings

There were many, most are fixed now. struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d2fb3437 11-Dec-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: fix leaking block group on balance

The block group structs are referenced in many different
places, and it's not safe to free while balancing. So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 3de4586c 17-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Allow subvolumes and snapshots anywhere in the directory tree

Before, all snapshots and subvolumes lived in a single flat directory. This
was awkward and confusing because the single flat directory was only writable
with the ioctls.

This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree. This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.

The subvol ioctl does:

btrfsctl -S subvol_name parent_dir

After the ioctl is done subvol_name lives inside parent_dir.

The snapshot ioctl does:

btrfsctl -s path_for_snapshot root_to_snapshot

path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
into directory and basename components.

root_to_snapshot can be any file or directory in the FS. The snapshot
is taken of the entire root where that file lives.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d0c803c4 11-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Record dirty pages tree-log pages in an extent_io tree

This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk. There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e02119d5 05-Sep-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a write ahead tree log to optimize synchronous operations

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b48652c1 04-Aug-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Various small fixes.

This trivial patch contains two locking fixes and a off by one fix.

---

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 9ca9ee09 04-Aug-2008 Sage Weil <sage@newdream.net>

Btrfs: fix ioctl-initiated transactions vs wait_current_trans()

Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
progress) breaks the transaction start/stop ioctls by making
btrfs_start_transaction conditionally wait for the next transaction to
start. If an application artificially is holding a transaction open,
things deadlock.

This workaround maintains a count of open ioctl-initiated transactions in
fs_info, and avoids wait_current_trans() if any are currently open (in
start_transaction() and btrfs_throttle()). The start transaction ioctl
uses a new btrfs_start_ioctl_transaction() that _does_ call
wait_current_trans(), effectively pushing the join/wait decision to the
outer ioctl-initiated transaction.

This more or less neuters btrfs_throttle() when ioctl-initiated
transactions are in use, but that seems like a pretty fundamental
consequence of wrapping lots of write()'s in a transaction. Btrfs has no
way to tell if the application considers a given operation as part of it's
transaction.

Obviously, if the transaction start/stop ioctls aren't being used, there
is no effect on current behavior.

Signed-off-by: Sage Weil <sage@newdream.net>
---
ctree.h | 1 +
ioctl.c | 12 +++++++++++-
transaction.c | 18 +++++++++++++-----
transaction.h | 2 ++
4 files changed, 27 insertions(+), 6 deletions(-)

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f321e491 30-Jul-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Update and fix mount -o nodatacow

To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ab78c84d 29-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Throttle operations if the reference cache gets too large

A large reference cache is directly related to a lot of work pending
for the cleaner thread. This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f9295749 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

btrfs_start_transaction: wait for commits in progress to finish

btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed. btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.

There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction. For them, btrfs_join_transaction
is added.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e6dcd2dc 16-Jul-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: New data=ordered implementation

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3f157a2f 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Online btree defragmentation fixes

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a74a4b97 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Replace the transaction work queue with kthreads

This creates one kthread for commits and one kthread for
deleting old snapshots. All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 89ce8a63 25-Jun-2008 Chris Mason <chris.mason@oracle.com>

Add btrfs_end_transaction_throttle to force writers to wait for pending commits

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added. This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d1310b2e 24-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Split the extent_map code into two parts

There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 3063d29f 08-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Move snapshot creation to commit time

It is very difficult to create a consistent snapshot of the btree when
other writers may update the btree before the commit is done.

This changes the snapshot creation to happen during the commit, while
no other updates are possible.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dc17ff8f 08-Jan-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add data=ordered support

This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6da6abae 18-Dec-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Back port to 2.6.18-el kernels

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5f39d397 15-Oct-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Create extent_buffer interface for large blocksizes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5ce14bbc 11-Sep-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Find and remove dead roots the first time a root is loaded.

Dead roots are trees left over after a crash, and they were either in the
process of being removed or were waiting to be removed when the box crashed.
Before, a search of the entire tree of root pointers was done on mount
looking for dead roots. Now, the search is done the first time we load
a root.

This makes mount faster when there are a large number of snapshots, and it
enables the block accounting code to properly update the block counts on
the latest root as old versions of the root are reaped after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 15ee9bc7 10-Aug-2007 Josef Bacik <jwhiter@redhat.com>

Btrfs: delay commits during fsync to allow more writers

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e9d0b13b 10-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Btree defrag on the extent-mapping tree as well

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 26b8003f 08-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Replace extent tree preallocation code with some bit radix magic.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6702ed49 07-Aug-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add run time btree defrag, and an ioctl to force btree defrag

This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 5eda7b5e 22-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: Add the ability to find and remove dead roots after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6cbd5570 12-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add GPLv2

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# facda1e7 08-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: get forced transaction commits via workqueue

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 08607c1b 08-Jun-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: add compat ioctl

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 31f3c99b 30-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: allocator improvements, inode block groups

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 7c4452b9 28-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: smarter transaction writeback

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8fd17795 19-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: early fsync support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2c90e5d6 02-Apr-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: still corruption hunting

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 79154b1b 22-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: transaction rework

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e20d96d6 21-Mar-2007 Chris Mason <chris.mason@oracle.com>

Mountable btrfs, with readdir

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e089f05c 16-Mar-2007 Chris Mason <chris.mason@oracle.com>

Btrfs: transaction handles everywhere

Signed-off-by: Chris Mason <chris.mason@oracle.com>