History log of /linux-master/fs/btrfs/qgroup.h
Revision Date Author Comments
# 86211eea 26-Feb-2024 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: validate btrfs_qgroup_inherit parameter

[BUG]
Currently btrfs can create subvolume with an invalid qgroup inherit
without triggering any error:

# mkfs.btrfs -O quota -f $dev
# mount $dev $mnt
# btrfs subvolume create -i 2/0 $mnt/subv1
# btrfs qgroup show -prce --sync $mnt
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 16.00KiB 16.00KiB subv1

[CAUSE]
We only do a very basic size check for btrfs_qgroup_inherit structure,
but never really verify if the values are correct.

Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
qgroups, and never return any error.

[FIX]
Fix the behavior and introduce extra checks:

- Introduce early check for btrfs_qgroup_inherit structure
Not only the size, but also all the qgroup ids would be verified.

And the timing is very early, so we can return error early.
This early check is very important for snapshot creation, as snapshot
is delayed to transaction commit.

- Drop support for btrfs_qgroup_inherit::num_ref_copies and
num_excl_copies
Those two members are used to specify to copy refr/excl numbers from
other qgroups.
This would definitely mark qgroup inconsistent, and btrfs-progs has
dropped the support for them for a long time.
It's time to drop the support for kernel.

- Verify the supported btrfs_qgroup_inherit::flags
Just in case we want to add extra flags for btrfs_qgroup_inherit.

Now above subvolume creation would fail with -ENOENT other than silently
ignore the non-existing qgroup.

CC: stable@vger.kernel.org # 6.7+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5693a128 26-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: add forward declarations and headers, part 3

Do a cleanup in the rest of the headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>


# e85a0ada 01-Dec-2023 Boris Burkov <boris@bur.io>

btrfs: ensure releasing squota reserve on head refs

A reservation goes through a 3 step lifetime:

- generated during delalloc
- released/counted by ordered_extent allocation
- freed by running delayed ref

That third step depends on must_insert_reserved on the head ref, so the
head ref with that field set owns the reservation. Once you prepare to
run the head ref, must_insert_reserved is unset, which means that
running the ref must free the reservation, whether or not it succeeds,
or else the reservation is leaked. That results in either a risk of
spurious ENOSPC if the fs stays writeable or a warning on unmount if it
is readonly.

The existing squota code was aware of these invariants, but missed a few
cases. Improve it by adding a helper function to use in the cleanup
paths and call it from the existing early returns in running delayed
refs. This also simplifies btrfs_record_squota_delta and struct
btrfs_quota_delta.

This fixes (or at least improves the reliability of) generic/475 with
"mkfs -O squota". On my machine, that test failed ~4/10 times without
this patch and passed 100/100 times with it.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9e65bfca 01-Dec-2023 Boris Burkov <boris@bur.io>

btrfs: fix qgroup_free_reserved_data int overflow

The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.

Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.

Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.

Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bd7c1ea3 28-Mar-2023 Boris Burkov <boris@bur.io>

btrfs: qgroup: check generation when recording simple quota delta

Simple quotas count extents only from the moment the feature is enabled.
Therefore, if we do something like:

1. create subvol S
2. write F in S
3. enable quotas
4. remove F
5. write G in S

then after 3. and 4. we would expect the simple quota usage of S to be 0
(putting aside some metadata extents that might be written) and after
5., it should be the size of G plus metadata. Therefore, we need to be
able to determine whether a particular quota delta we are processing
predates simple quota enablement.

To do this, store the transaction id when quotas were enabled. In
fs_info for immediate use and in the quota status item to make it
recoverable on mount. When we see a delta, check if the generation of
the extent item is less than that of quota enablement. If so, we should
ignore the delta from this extent.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5343cd93 28-Mar-2023 Boris Burkov <boris@bur.io>

btrfs: qgroup: simple quota auto hierarchy for nested subvolumes

Consider the following sequence:

- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/

With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.

Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.

In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.

I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1e0e9d57 28-Mar-2023 Boris Burkov <boris@bur.io>

btrfs: add helper for recording simple quota deltas

Rather than re-computing shared/exclusive ownership based on backrefs
and walking roots for implicit backrefs, simple quotas does an increment
when creating an extent and a decrement when deleting it. Add the API
for the extent item code to use to track those events.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 182940f4 16-May-2023 Boris Burkov <boris@bur.io>

btrfs: qgroup: add new quota mode for simple quotas

Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b0cd63b 16-May-2023 Boris Burkov <boris@bur.io>

btrfs: qgroup: introduce quota mode

In preparation for introducing simple quotas, change from a binary
setting for quotas to an enum based mode. Initially, the possible modes
are disabled/full. Full quotas is normal btrfs qgroups.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33b6b251 07-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: move functions comments from qgroup.h to qgroup.c

We keep the comments next to the implementation, there were some left
to move.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# dce28769 01-Sep-2023 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()

The ulist @qgroups is utilized to record all involved qgroups from both
old and new roots inside btrfs_qgroup_account_extent().

Due to the fact that qgroup_update_refcnt() itself is already utilizing
qgroup_iterator, here we have to introduce another list_head,
btrfs_qgroup::nested_iterator, allowing nested iteration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 686c4a5a 01-Sep-2023 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()

Qgroup heavily relies on ulist to go through all the involved
qgroups, but since we're using ulist inside fs_info->qgroup_lock
spinlock, this means we're doing a lot of GFP_ATOMIC allocations.

This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by
eliminating the memory allocation completely.

This is done by moving the needed memory to btrfs_qgroup::iterator
list_head, so that we can put all the involved qgroup into a on-stack
list, thus eliminating the need to allocate memory while holding
spinlock.

The only cost is the slightly higher memory usage, but considering the
reduce GFP_ATOMIC during a hot path, it should still be acceptable.

Function qgroup_reserve() is the perfect start point for this
conversion.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e2896e79 14-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extent

All callers pass GFP_NOFS, we can drop the parameter and use it
directly.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e0a8b9a7 09-Sep-2022 David Sterba <dsterba@suse.com>

btrfs: convert QGROUP_* defines to enum bits

The defines/enums are used only for tracepoints and are not part of the
on-disk format.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e15e9f43 23-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting

The new flag will make btrfs qgroup skip all its time consuming
qgroup accounting.

The lifespan is the same as BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN,
only get cleared after a new rescan.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e562a8bd 23-Aug-2022 Qu Wenruo <wqu@suse.com>

btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN

Introduce a new runtime flag, BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN,
which will inform qgroup rescan to cancel its work asynchronously.

This is to address the window when an operation makes qgroup numbers
inconsistent (like qgroup inheriting) while a qgroup rescan is running.

In that case, qgroup inconsistent flag will be cleared when qgroup
rescan finishes.
But we changed the ownership of some extents, which means the rescan is
already meaningless, and the qgroup inconsistent flag should not be
cleared.

With the new flag, each time we set INCONSISTENT flag, we also set this
new flag to inform any running qgroup rescan to exit immediately, and
leaving the INCONSISTENT flag there.

The new runtime flag can only be cleared when a new rescan is started.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d4135134 23-Mar-2022 Filipe Manana <fdmanana@suse.com>

btrfs: avoid blocking on space revervation when doing nowait dio writes

When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.

So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.

This is part of a patchset comprised of the following patches:

btrfs: avoid blocking on page locks with nowait dio on compressed range
btrfs: avoid blocking nowait dio when locking file range
btrfs: avoid double nocow check when doing nowait dio writes
btrfs: stop allocating a path when checking if cross reference exists
btrfs: free path at can_nocow_extent() before checking for checksum items
btrfs: release path earlier at can_nocow_extent()
btrfs: avoid blocking when allocating context for nowait dio read/write
btrfs: avoid blocking on space revervation when doing nowait dio writes

The following test was run before and after applying this patchset:

$ cat io-uring-nodatacow-test.sh
#!/bin/bash

DEV=/dev/sdc
MNT=/mnt/sdc

MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"

NUM_JOBS=4
FILE_SIZE=8G
RUN_TIME=300

cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randrw
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF

echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT

fio /tmp/fio-job.ini

umount $MNT

The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.

Result before the patchset:

READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec

Result after the patchset:

READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec

That's about +7.2% throughput for reads and +6.9% for writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8949b9a1 21-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix lock inversion problem when doing qgroup extent tracing

At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
NULL value as the transaction handle argument, which makes that function
take the commit_root_sem semaphore, which is necessary when we don't hold
a transaction handle or any other mechanism to prevent a transaction
commit from wiping out commit roots.

However btrfs_qgroup_trace_extent_post() can be called in a context where
we are holding a write lock on an extent buffer from a subvolume tree,
namely from btrfs_truncate_inode_items(), called either during truncate
or unlink operations. In this case we end up with a lock inversion problem
because the commit_root_sem is a higher level lock, always supposed to be
acquired before locking any extent buffer.

Lockdep detects this lock inversion problem since we switched the extent
buffer locks from custom locks to semaphores, and when running btrfs/158
from fstests, it reported the following trace:

[ 9057.626435] ======================================================
[ 9057.627541] WARNING: possible circular locking dependency detected
[ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
[ 9057.628961] ------------------------------------------------------
[ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
[ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.632542]
but task is already holding lock:
[ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.635255]
which lock already depends on the new lock.

[ 9057.636292]
the existing dependency chain (in reverse order) is:
[ 9057.637240]
-> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
[ 9057.638138] down_read+0x46/0x140
[ 9057.638648] btrfs_find_all_roots+0x41/0x80 [btrfs]
[ 9057.639398] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
[ 9057.640283] btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
[ 9057.641114] btrfs_free_extent+0x35/0xb0 [btrfs]
[ 9057.641819] btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
[ 9057.642643] btrfs_evict_inode+0x454/0x4f0 [btrfs]
[ 9057.643418] evict+0xcf/0x1d0
[ 9057.643895] do_unlinkat+0x1e9/0x300
[ 9057.644525] do_syscall_64+0x3b/0xc0
[ 9057.645110] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9057.645835]
-> #0 (btrfs-tree-00){++++}-{3:3}:
[ 9057.646600] __lock_acquire+0x130e/0x2210
[ 9057.647248] lock_acquire+0xd7/0x310
[ 9057.647773] down_read_nested+0x4b/0x140
[ 9057.648350] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.649175] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.650010] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.650849] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.651733] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.652501] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.653264] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.654295] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.655111] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.655831] process_one_work+0x247/0x5a0
[ 9057.656425] worker_thread+0x55/0x3c0
[ 9057.656993] kthread+0x155/0x180
[ 9057.657494] ret_from_fork+0x22/0x30
[ 9057.658030]
other info that might help us debug this:

[ 9057.659064] Possible unsafe locking scenario:

[ 9057.659824] CPU0 CPU1
[ 9057.660402] ---- ----
[ 9057.660988] lock(&fs_info->commit_root_sem);
[ 9057.661581] lock(btrfs-tree-00);
[ 9057.662348] lock(&fs_info->commit_root_sem);
[ 9057.663254] lock(btrfs-tree-00);
[ 9057.663690]
*** DEADLOCK ***

[ 9057.664437] 4 locks held by kworker/u16:4/30781:
[ 9057.665023] #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.666260] #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.667639] #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
[ 9057.669017] #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.670408]
stack backtrace:
[ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
[ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[ 9057.674258] Call Trace:
[ 9057.674588] dump_stack_lvl+0x57/0x72
[ 9057.675083] check_noncircular+0xf3/0x110
[ 9057.675611] __lock_acquire+0x130e/0x2210
[ 9057.676132] lock_acquire+0xd7/0x310
[ 9057.676605] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.677313] ? lock_is_held_type+0xe8/0x140
[ 9057.677849] down_read_nested+0x4b/0x140
[ 9057.678349] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679068] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679760] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.680458] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.681083] ? _raw_spin_unlock+0x29/0x40
[ 9057.681594] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.682336] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.683058] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.683834] ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
[ 9057.684632] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.685316] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.685977] ? ___ratelimit+0xa4/0x110
[ 9057.686460] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.687316] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.688021] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.688649] ? lock_is_held_type+0xe8/0x140
[ 9057.689180] process_one_work+0x247/0x5a0
[ 9057.689696] worker_thread+0x55/0x3c0
[ 9057.690175] ? process_one_work+0x5a0/0x5a0
[ 9057.690731] kthread+0x155/0x180
[ 9057.691158] ? set_kthread_struct+0x40/0x40
[ 9057.691697] ret_from_fork+0x22/0x30

Fix this by making btrfs_find_all_roots() never attempt to lock the
commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().

We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
from btrfs_qgroup_trace_extent_post(), because that would make backref
lookup not use commit roots and acquire read locks on extent buffers, and
therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
from the btrfs_truncate_inode_items() code path which has acquired a write
lock on an extent buffer of the subvolume btree.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 80e9baed 22-Feb-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: export and rename qgroup_reserve_meta

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 49e5fb46 27-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: export qgroups in sysfs

This patch will add the following sysfs interface:

/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags

Which is also available in output of "btrfs qgroup show".

/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc

The last 3 rsv related members are not visible to users, but can be very
useful to debug qgroup limit related bugs.

Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
level and qgroup id is changed to '_'.

The interface is not hidden behind 'debug' as we want this interface to
be included into production build and to provide another way to read the
qgroup information besides the ioctls.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cfdd4592 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inode

vfs_inode is used only for the inode number everything else requires
btrfs_inode.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use btrfs_ino ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 7661a3e0 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_qgroup_reserve_data take btrfs_inode

There's only a single use of vfs_inode in a tracepoint so let's take
btrfs_inode directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 72b7d15b 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_qgroup_release_data take btrfs_inode

It just forwards its argument to __btrfs_qgroup_release_data.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8b8a979f 02-Jun-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_qgroup_free_data take btrfs_inode

It passes btrfs_inode to its callee so change the interface.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5958253c 09-Jun-2020 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: catch reserved space leaks at unmount time

Before this patch, qgroup completely relies on per-inode extent io tree
to detect reserved data space leak.

However previous bug has already shown how release page before
btrfs_finish_ordered_io() could lead to leak, and since it's
QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
detected by per-inode extent io tree.

So this patch adds another (and hopefully the final) safety net to catch
qgroup data reserved space leak. At least the new safety net catches
all the leaks during development, so it should be pretty useful in the
real world.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81f7eb00 11-Feb-2020 Jeff Mahoney <jeffm@suse.com>

btrfs: destroy qgroup extent records on transaction abort

We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.

This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.

Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 32da5386 29-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1418bae1 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record

[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.

Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.

The simplified reproducer will be: (needs a 2G ram VM)

$ mkfs.btrfs -f $dev
$ mount $dev $mnt

$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv

$ for i in $(seq -w 1 8); do
xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
echo "file $i written" > /dev/kmsg
done
$ sync
$ btrfs qgroup show -pcre --raw $mnt

The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:

qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16384 16384 none none --- ---
0/256 1073758208 1073758208 none 1073741824 --- ---

And 1073758208 is larger than
> 1073741824.

[CAUSE]
It's a bug in btrfs qgroup data reserved space management.

For quota limit, we must ensure that:
reserved (data + metadata) + rfer/excl <= limit

Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.

One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.

Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.

However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.

The related events will be something like:

file 1 written
btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
cleanup_ref_head: num_bytes=54947840
cleanup_ref_head: num_bytes=5636096
cleanup_ref_head: num_bytes=569344
cleanup_ref_head: num_bytes=57344
cleanup_ref_head: num_bytes=8192
^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
file 2 written
...
file 8 written
cleanup_ref_head: num_bytes=8192
...
btrfs_commit_transaction <<< the only transaction committed during
the test

When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.

This allows us to write more data beyond qgroup limit.

In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.

Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9627736b 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Cleanup old subtree swap code

Since it's replaced by new delayed subtree swap code, remove the
original code.

The cleanup is small since most of its core function is still used by
delayed subtree swap trace.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f616f5cd 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Use delayed subtree rescan for balance

Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.

This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.

However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.

It's the race window between subtree swap and transaction commit could
cause qgroup number change.

This patch will delay the qgroup subtree scan until COW happens for the
subtree root.

So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.

Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)

[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.

Mkfs parameter:
--nodesize 4K (To bump up tree size)

Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)

Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.

balance parameter:
-m

So the content should be pretty similar to a real world root fs layout.

And after file system population, there is no other activity, so it
should be the best case scenario.

| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 370a11b8 23-Jan-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Introduce per-root swapped blocks infrastructure

To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.

The designed workflow will be:

1) Record the subtree root block that gets swapped.

During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF

In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.

2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF

3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.

3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.

4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.

This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 52042d8e 27-Nov-2018 Andrea Gelmini <andrea.gelmini@gelma.net>

btrfs: Fix typos in comments and strings

The typos accumulate over time so once in a while time they get fixed in
a large patch.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bbe339cc 27-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: drop extra enum initialization where using defaults

The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3628b4ca 09-Oct-2018 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled

Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.

This is caused by the lack of qgroup status check before calling some
qgroup functions. Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.

This patch will do earlier check before triggering related trace events.

And for enabled <-> disabled race case:

1) For enabled->disabled case
Disable will wipe out all qgroups data including reservation and
excl/rfer. Even if we leak some reservation or numbers, it will
still be cleared, so nothing will go wrong.

2) For disabled -> enabled case
Current btrfs_qgroup_release_data() will use extent_io tree to ensure
we won't underflow reservation. And for delayed_ref we use
head->qgroup_reserved to record the reserved space, so in that case
head->qgroup_reserved should be 0 and we won't underflow.

CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3d0174f7 27-Sep-2018 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group

For qgroup_trace_extent_swap(), if we find one leaf that needs to be
traced, we will also iterate all file extents and trace them.

This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.

That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.

[[Benchmark]] (with all previous enhancement)
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.

Mkfs parameter:
--nodesize 4K (To bump up tree size)

Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)

Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.

balance parameter:
-m

So the content should be pretty similar to a real world root fs layout.

| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22851 | -0.3%
qgroup dirty extents | 227757 | 140886 | -38.1%
time (sys) | 65.253s | 37.464s | -42.6%
time (real) | 74.032s | 44.722s | -39.6%

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5f527822 27-Sep-2018 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents

Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.

E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)

File tree (src) Reloc tree (dst)
OO (a) NN (a)
/ \ / \
(b) OO OO (c) (b) NN NN (c)
/ \ / \ / \ / \
OO OO OO OO (d) OO OO OO NN (d)

For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.

It's doable for small file tree or new tree blocks are all located at
lower level.

But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.

This patch will change how we handle such balance with quota enabled
case.

Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.

In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification). And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)

For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.

This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:

1) Read out real root eb
And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
To trace all new tree blocks in reloc tree and their counter
parts in the file tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a9377422 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 280f8bd2 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8696d760 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# deb40627 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree

The fs_info can be fetched from the transaction handle directly.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8d38d7eb 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a95f3aaf 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent

It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f0042d5e 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3efbee1d 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 49a05ecd 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 39616c27 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f8a6ce6 18-Jul-2018 Lu Fengqi <lufq.fnst@cn.fujitsu.com>

btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation

It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 340f1aa2 05-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable

Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:

- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed

This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.

This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.

Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9888c340 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- headers

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>


# 64cfaef6 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS

For meta_prealloc reservation users, after btrfs_join_transaction()
caller will modify tree so part (or even all) meta_prealloc reservation
should be converted to meta_pertrans until transaction commit time.

This patch introduces a new function,
btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
reservation user.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 733e03a0 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans

Btrfs uses 2 different methods to reseve metadata qgroup space.

1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.

In this case, reserved metadata should be kept until qgroup numbers
are updated.

2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.

In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.

Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:

META_PERTRANS:
For above case 1)

META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)

NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c40507f 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Cleanup the remaining old reservation counters

So qgroup is switched to new separate types reservation system.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d4e5c920 12-Dec-2017 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Skeleton to support separate qgroup reservation type

Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv
to store different types of reservation.

This patch only updates the header needed to compile.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 460fb20a 15-Mar-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Drop fs_info parameter from btrfs_qgroup_account_extents

It's provided by the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc42bda2 27-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges

[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 364ecf36 27-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d1b8b94a 27-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

Quite a lot of qgroup corruption happens due to wrong time of calling
btrfs_qgroup_prepare_account_extents().

Since the safest time is to call it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
functions.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog and comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# d51ea5dd 13-Mar-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint

Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.

While some other trace points are not well placed to co-operate with
them, causing confusing output.

This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3159fe7b 13-Mar-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add trace point for qgroup reserved space

Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve

These trace points are handy to trace qgroup reserve space related
problems.

Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
structure to trace points, so that structure needs to be exported.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f486135e 15-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: remove unused qgroup members from btrfs_trans_handle

The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562dec83b3), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb235dc0 14-Feb-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Move half of the qgroup accounting time out of commit trans

Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.

What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.

Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.

However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.

This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().

But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 003d7c59 25-Jan-2017 Jeff Mahoney <jeffm@suse.com>

btrfs: allow unlink to exceed subvolume quota

Once a qgroup limit is exceeded, it's impossible to restore normal
operation to the subvolume without modifying the limit or removing
the subvolume. This is a surprising situation for many users used
to the typical workflow with quotas on other file systems where it's
possible to remove files until the used space is back under the limit.

When we go to unlink a file and start the transaction, we'll hit
the qgroup limit while trying to reserve space for the items we'll
modify while removing the file. We discussed last month how best
to handle this situation and agreed that there is no perfect solution.
The best principle-of-least-surprise solution is to handle it similarly
to how we already handle ENOSPC when unlinking, which is to allow
the operation to succeed with the expectation that it will ultimately
release space under most circumstances.

This patch modifies the transaction start path to select whether to
honor the qgroups limits. btrfs_start_transaction_fallback_global_rsv
is the only caller that skips enforcement. The reservation and tracking
still happens normally -- it just skips the enforcement step.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33d1f05c 17-Oct-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c

Move account_shared_subtree() to qgroup.c and rename it to
btrfs_qgroup_trace_subtree().

Do the same thing for account_leaf_items() and rename it to
btrfs_qgroup_trace_leaf_items().

Since all these functions are only for qgroup, move them to qgroup.c and
export them is more appropriate.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 50b3e040 17-Oct-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps

Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1d2beaa9 17-Oct-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add comments explaining how btrfs qgroup works

Add explaination how btrfs qgroups work.

Qgroup is split into 3 main phrases:
1) Reserve
To ensure qgroup doesn't exceed its limit

2) Trace
To info qgroup to trace which extent

3) Account
Calculate qgroup number change for each traced extent.

This should save quite some time for new developers.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb93b52c 14-Aug-2016 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()

Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.

Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.

2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.

Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.

The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.

Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d06f23d6 08-Aug-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: waiting on qgroup rescan should not always be interruptible

We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl. If the
user sends a signal while we're waiting, we continue happily along. This
is expected behavior for the rescan wait ioctl. It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# bc074524 09-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: prefix fsid to all trace events

When using trace events to debug a problem, it's impossible to determine
which file system generated a particular event. This patch adds a
macro to prefix standard information to the head of a trace event.

The extent_state alloc/free events are all that's left without an
fs_info available.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 56fa9d07 12-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Check if qgroup reserved space leaked

Add check at btrfs_destroy_inode() time to detect qgroup reserved space
leak.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 81fb6f77 28-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add new trace point for qgroup data reserve

Now each qgroup reserve for data will has its ftrace event for better
debugging.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7cf5b976 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Cleanup old inaccurate facilities

Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 55eeaf05 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce new functions to reserve/free metadata

Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free
metadata reserved space.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 297d750b 08-Sep-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: delayed_ref: release and free qgroup reserved at proper timing

Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:

1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.

2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().

With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f695fdce 12-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce functions to release/free qgroup reserve data
space

Introduce functions btrfs_qgroup_release/free_data() to release/free
reserved data range.

Release means, just remove the data range from io_tree, but doesn't
free the reserved space.
This is for normal buffered write case, when data is written into disc
and its metadata is added into tree, its reserved space should still be
kept until commit_trans().
So in that case, we only release dirty range, but keep the reserved
space recorded some other place until commit_tran().

Free means not only remove data range, but also free reserved space.
This is used for case for cleanup and invalidate page.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 52472553 12-Oct-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function

Introduce a new function, btrfs_qgroup_reserve_data(), which will use
io_tree to accurate qgroup reserve, to avoid reserved space leaking.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e69bcee3 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.

Goodbye, the old mechanisim.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 442244c9 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.

Since the self test transaction don't have delayed_ref_roots, so use
find_all_roots() and export btrfs_qgroup_account_extent() to simulate it

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 550d7a2e 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add new qgroup calculation function
btrfs_qgroup_account_extents().

The new btrfs_qgroup_account_extents() function should be called in
btrfs_commit_transaction() and it will update all the qgroup according
to delayed_ref_root->dirty_extent_root.

The new function can handle both normal operation during
commit_transaction() or in rescan in a unified method with clearer
logic.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 3b7d00f9 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Add new function to record old_roots.

Add function btrfs_qgroup_prepare_account_extents() to get old_roots
which are needed for qgroup.

We do it in commit_transaction() and before switch_roots(), and only
search commit_root, so it gives a quite accurate view for previous
transaction.

With old_roots from previous transaction, we can use it to do accurate
account with current transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 3368d001 16-Apr-2015 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: qgroup: Record possible quota-related extent for qgroup.

Add hook in add_delayed_ref_head() to record quota-related extent record
into delayed_ref_root->dirty_extent_record rb-tree for later qgroup
accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# e2d1f923 06-Feb-2015 Dongsheng Yang <yangds.fnst@cn.fujitsu.com>

btrfs: qgroup: do a reservation in a higher level.

There are two problems in qgroup:

a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.

b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.

The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.

In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 31193213 12-Dec-2014 Dongsheng Yang <yangds.fnst@cn.fujitsu.com>

Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.

Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.

Example:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 20987904 20987904 0 10485760 --- ---

qg->excl is 20987904 larger than max_excl 10485760.

This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use --- qgroup->reserved --- qgroup->excl

With this patch applied:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 9453568 9453568 0 10485760 --- ---

Reported-by: Cyril SCETBON <cyril.scetbon@free.fr>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 4087cf24 18-Jan-2015 Dongsheng Yang <yangds.fnst@cn.fujitsu.com>

Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1152651a 17-Jul-2014 Mark Fasheh <mfasheh@suse.de>

btrfs: qgroup: account shared subtrees during snapshot delete

During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# fcebe456 13-May-2014 Josef Bacik <jbacik@fb.com>

Btrfs: rework qgroup accounting

Currently qgroups account for space by intercepting delayed ref updates to fs
trees. It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly. The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together. Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc. This patch
accomplishes this by only adding qgroup operations for real ref changes. We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time. This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq: we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence. This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point. This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>