Searched +hist:0 +hist:ed586d0 (Results 1 - 1 of 1) sorted by relevance

/linux-master/fs/btrfs/
H A Dtree-log.cdiff e383e158 Thu Jan 25 17:59:23 MST 2024 Filipe Manana <fdmanana@suse.com> btrfs: preallocate temporary extent buffer for inode logging when needed

When logging an inode and we require to copy items from subvolume leaves
to the log tree, we clone each subvolume leaf and than use that clone to
copy items to the log tree. This is required to avoid possible deadlocks
as stated in commit 796787c978ef ("btrfs: do not modify log tree while
holding a leaf from fs tree locked").

The cloning requires allocating an extent buffer (struct extent_buffer)
and then allocating pages (folios) to attach to the extent buffer. This
may be slow in case we are under memory pressure, and since we are doing
the cloning while holding a read lock on a subvolume leaf, it means we
can be blocking other operations on that leaf for significant periods of
time, which can increase latency on operations like creating other files,
renaming files, etc. Similarly because we're under a log transaction, we
may also cause extra delay on other tasks doing an fsync, because syncing
the log requires waiting for tasks that joined a log transaction to exit
the transaction.

So to improve this, for any inode logging operation that needs to copy
items from a subvolume leaf ("full sync" or "copy everything" bit set
in the inode), preallocate a dummy extent buffer before locking any
extent buffer from the subvolume tree, and even before joining a log
transaction, add it to the log context and then use it when we need to
copy items from a subvolume leaf to the log tree. This avoids making
other operations get extra latency when waiting to lock a subvolume
leaf that is used during inode logging and we are under heavy memory
pressure.

The following test script with bonnie++ was used to test this:

$ cat test.sh
#!/bin/bash

DEV=/dev/sdh
MNT=/mnt/sdh
MOUNT_OPTIONS="-o ssd"

MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'`
NR_DIRECTORIES=20
NR_FILES=20480
DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576))
DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES))
NR_FILES=$((NR_FILES / 1024))

echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

bonnie++ -u root -d $MNT \
-n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
-r 0 -s $DATASET_SIZE -b

umount $MNT

The results of this test on a 8G VM running a non-debug kernel (Debian's
default kernel config), were the following.

Before this change:

Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++
Latency 35068us 24976us 2944ms 30725us 71770us 26152us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61
Latency 411ms 11914us 119ms 617ms 10296us 110ms

After this change:

Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++
Latency 35975us 20945us 2144ms 10297us 2217us 6004us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59
Latency 320ms 11237us 77779us 518ms 6470us 86389us

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 0f8ac74d Fri Sep 22 04:39:08 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use extent_io_tree_release() to empty dirty log pages

When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.

So use extent_io_tree_release() instead of clear_extent_bits().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 0f8ac74d Fri Sep 22 04:39:08 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use extent_io_tree_release() to empty dirty log pages

When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.

So use extent_io_tree_release() instead of clear_extent_bits().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 50564b65 Tue Sep 12 06:04:29 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: abort transaction on generation mismatch when marking eb as dirty

When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff b4c639f6 Mon Sep 04 22:15:24 MDT 2023 Josef Bacik <josef@toxicpanda.com> btrfs: initialize start_slot in btrfs_log_prealloc_extents

Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this

fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~

The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff b4c639f6 Mon Sep 04 22:15:24 MDT 2023 Josef Bacik <josef@toxicpanda.com> btrfs: initialize start_slot in btrfs_log_prealloc_extents

Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this

fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~

The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff b4c639f6 Mon Sep 04 22:15:24 MDT 2023 Josef Bacik <josef@toxicpanda.com> btrfs: initialize start_slot in btrfs_log_prealloc_extents

Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this

fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~

The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff d67ba263 Tue May 16 17:02:13 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use inode_logged() at btrfs_record_unlink_dir()

At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/nullb0

num_init_files=10000
num_new_files=10000

mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done

echo -n > $MNT/testdir/foo

sync

# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done

# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo

# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!

echo 2 > /proc/sys/vm/drop_caches

kill $pid
wait $pid

# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo

# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))

echo "file fsync took: $dur milliseconds"

umount $MNT

Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff d67ba263 Tue May 16 17:02:13 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use inode_logged() at btrfs_record_unlink_dir()

At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/nullb0

num_init_files=10000
num_new_files=10000

mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done

echo -n > $MNT/testdir/foo

sync

# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done

# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo

# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!

echo 2 > /proc/sys/vm/drop_caches

kill $pid
wait $pid

# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo

# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))

echo "file fsync took: $dur milliseconds"

umount $MNT

Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff d67ba263 Tue May 16 17:02:13 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use inode_logged() at btrfs_record_unlink_dir()

At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/nullb0

num_init_files=10000
num_new_files=10000

mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done

echo -n > $MNT/testdir/foo

sync

# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done

# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo

# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!

echo 2 > /proc/sys/vm/drop_caches

kill $pid
wait $pid

# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo

# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))

echo "file fsync took: $dur milliseconds"

umount $MNT

Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff d67ba263 Tue May 16 17:02:13 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use inode_logged() at btrfs_record_unlink_dir()

At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/nullb0

num_init_files=10000
num_new_files=10000

mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done

echo -n > $MNT/testdir/foo

sync

# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done

# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo

# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!

echo 2 > /proc/sys/vm/drop_caches

kill $pid
wait $pid

# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo

# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))

echo "file fsync took: $dur milliseconds"

umount $MNT

Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff d67ba263 Tue May 16 17:02:13 MDT 2023 Filipe Manana <fdmanana@suse.com> btrfs: use inode_logged() at btrfs_record_unlink_dir()

At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.

However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.

Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:

$ cat test.sh
#!/bin/bash

DEV=/dev/nullb0
MNT=/mnt/nullb0

num_init_files=10000
num_new_files=10000

mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT

mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done

echo -n > $MNT/testdir/foo

sync

# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done

# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo

# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!

echo 2 > /proc/sys/vm/drop_caches

kill $pid
wait $pid

# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo

# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))

echo "file fsync took: $dur milliseconds"

umount $MNT

Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Completed in 457 milliseconds