History log of /linux-master/fs/btrfs/volumes.h
Revision Date Author Comments
# 1cdeac6d 22-Feb-2024 David Sterba <dsterba@suse.com>

btrfs: pass btrfs_device to btrfs_scratch_superblocks()

Replace the two parameters bdev and name by one that can be used to get
them both.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 2761ece8 05-Feb-2024 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: introduce offload_csum_mode to tweak checksum offloading behavior

We disable offloading checksum to workqueues and do it synchronously when
the checksum algorithm is fast. However, as reported in the link below,
RAID0 with multiple devices may suffer from the sync checksum, because
"fast checksum" is still not fast enough to catch up with RAID0 writing.

We don't have an effective way to determine whether to offload or not,
for now add a sysfs knob so this can be debugged. This is intentionally
under CONFIG_BTRFS_DEBUG so ti's not exposed to users as it may be
removed in the future agin.

Introduce fs_devices->offload_csum_mode, so that a btrfs developer can
change the behavior by writing to /sys/fs/btrfs/<uuid>/offload_csum. The
default is "auto" which is the same as the previous behavior. Or, you
can set "on" or "off" (or "y" or "n" whatever kstrtobool() accepts) to
always/never offload checksum.

More benchmark need to be collected with this knob to implement a proper
criteria to enable/disable checksum offloading.

Link: https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5693a128 26-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: add forward declarations and headers, part 3

Do a cleanup in the rest of the headers:

- add forward declarations for types referenced by pointers
- add includes when types need them

This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9ae061cf 23-Jan-2024 Christian Brauner <brauner@kernel.org>

btrfs: port device access to file

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-19-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# eefaf0a1 05-Dec-2023 David Sterba <dsterba@suse.com>

btrfs: fix typos found by codespell

Signed-off-by: David Sterba <dsterba@suse.com>


# 7dc66abb 21-Nov-2023 Filipe Manana <fdmanana@suse.com>

btrfs: use a dedicated data structure for chunk maps

Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.

Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.

We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 86ec15d0 27-Sep-2023 Jan Kara <jack@suse.cz>

btrfs: Convert to bdev_open_by_path()

Convert btrfs to use bdev_open_by_path() and pass the handle around. We
also drop the holder from struct btrfs_device as it is now not needed
anymore.

CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 000331bb 04-Oct-2023 Anand Jain <anand.jain@oracle.com>

btrfs: update comment for temp-fsid, fsid, and metadata_uuid

Update the comment to explain the relationship between temp_fsid, fsid,
and metadata_uuid.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a5b8a5f9 27-Sep-2023 Anand Jain <anand.jain@oracle.com>

btrfs: support cloned-device mount capability

Guilherme's previous work [1] aimed at the mounting of cloned devices
using a superblock flag SINGLE_DEV during mkfs.
[1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/

Building upon this work, here is in memory only approach. As it mounts
we determine if the same fsid is already mounted if then we generate a
random temp fsid which shall be used the mount, in memory only not
written to the disk. We distinguish devices by devt.

Example:
$ fallocate -l 300m ./disk1.img
$ mkfs.btrfs -f ./disk1.img
$ cp ./disk1.img ./disk2.img
$ cp ./disk1.img ./disk3.img
$ mount -o loop ./disk1.img /btrfs
$ mount -o ./disk2.img /btrfs1
$ mount -o ./disk3.img /btrfs2

$ btrfs fi show -m
Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop0

Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop1

Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
Total devices 1 FS bytes used 144.00KiB
devid 1 size 300.00MiB used 88.00MiB path /dev/loop2

Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5966930d 20-Sep-2023 Anand Jain <anand.jain@oracle.com>

btrfs: remove incomplete metadata_uuid conversion fixup logic

Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has
stopped the assembly of devices with the CHANGING_FSID_V2 flag in the
kernel. Such devices can be scanned but will not be registered and can't
be mounted without a manual fix by btrfstune. Remove the related logic
and now unused code.

The original motivation was to allow an interrupted partial conversion
fix itself on next mount, in case the system has to be rebooted. This is
a convenience but brings a lot of complexity the device scanning and
handling the partial states. It's hard to estimate if this was ever
needed in practice, expecting the typical use case like a manual
conversion of an unmounted filesystem where the user can verify the
success and rerun it eventually.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add historical context ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 9acaa641 14-Sep-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: scrub: implement raid stripe tree support

A filesystem that uses the raid stripe tree for logical to physical
address translation can't use the regular scrub path, that reads all
stripes and then checks if a sector is unused afterwards.

When using the raid stripe tree, this will result in lookup errors, as
the stripe tree doesn't know the requested logical addresses.

In case we're scrubbing a filesystem which uses the RAID stripe tree for
multi-device logical to physical address translation, perform an extra
block mapping step to get the real on-disk stripe length from the stripe
tree when scrubbing the sectors.

This prevents a double completion of the btrfs_bio caused by splitting the
underlying bio and ultimately a use-after-free.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 02c372e1 14-Sep-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: add support for inserting raid stripe extents

Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to

a) support zone-append and
b) rule out the possibility of a RAID-write-hole.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc27d6f0 08-Sep-2023 Anand Jain <anand.jain@oracle.com>

btrfs: scan but don't register device on single device filesystem

After the commit 5f58d783fd78 ("btrfs: free device in btrfs_close_devices
for a single device filesystem") we unregister the device from the kernel
memory upon unmounting for a single device.

So, device registration that was performed before mounting if any is no
longer in the kernel memory.

However, in fact, note that device registration is unnecessary for a
single-device btrfs filesystem unless it's a seed device.

So for commands like 'btrfs device scan' or 'btrfs device ready' with a
non-seed single-device btrfs filesystem, they can return success just
after superblock verification and without the actual device scan. When
'device scan --forget' is called on such device no error is returned.

The seed device must remain in the kernel memory to allow the sprout
device to mount without the need to specify the seed device explicitly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9fb2acc2 17-Sep-2023 Qu Wenruo <wqu@suse.com>

btrfs: remove the need_raid_map parameter from btrfs_map_block()

The parameter @need_raid_map is mostly a legacy from the old days where
we don't yet have a solid definition on the @mirror_num, and only
check-integrity was using that parameter, while all other call sites
just pass 1 for that parameter.

Now since we have removed check-integrity functionality, we can also
remove the @need_raid_map parameter.

This change will also remove the ability to read P/Q stripe directly
when passing 0 as @need_raid_map.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb6eb475 06-Sep-2023 Anand Jain <anand.jain@oracle.com>

btrfs: comment about fsid and metadata_uuid relationship

Add a comment explaining the relationship between fsid and metadata_uuid
in the on-disk superblock and the in-memory struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4844c366 31-Jul-2023 Anand Jain <anand.jain@oracle.com>

btrfs: add a helper to read the superblock metadata_uuid

In some cases, we need to read the FSID from the superblock when the
metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
add a helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 883647f4 26-Jul-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make find_free_dev_extent() static

The function find_free_dev_extent() is only used within volumes.c, so make
it static and remove its prototype from volumes.h.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cb091225 22-Jun-2023 Qu Wenruo <wqu@suse.com>

btrfs: fix remaining u32 overflows when left shifting stripe_nr

There was regression caused by a97699d1d610 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by
a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].

The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().

Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types. This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.

[1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/

Fixes: a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 723b8bb1 30-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: open code btrfs_map_sblock

btrfs_map_sblock just hard codes three arguments and calls
btrfs_map_sblock. Remove it as it doesn't provide any real value, but
makes following the btrfs_map_block call chains harder.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cd4efd21 30-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: rename __btrfs_map_block to btrfs_map_block

Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d69d7ffc 30-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove unused btrfs_map_block

There are no users of btrfs_map_block left, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3965a4c7 30-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove unused BTRFS_MAP_DISCARD

BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
btrfs_op() only only checked in two ASSERTS.

Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06f188
("btrfs: split discard handling out of btrfs_map_block").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d85512d5 24-May-2023 Anand Jain <anand.jain@oracle.com>

btrfs: add and fix comments in btrfs_fs_devices

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f62c302e 24-May-2023 Anand Jain <anand.jain@oracle.com>

btrfs: add comment about metadata_uuid in btrfs_fs_devices

Add comment about metadata_uuid in btrfs_fs_devices.
No functional change.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4693893b 24-May-2023 Anand Jain <anand.jain@oracle.com>

btrfs: reduce struct btrfs_fs_devices size by moving fsid_change

Pack bool fsid_change and bool seeding with other bool declarations in the
struct btrfs_fs_devices, approximately 6 bytes is saved, depending on
the config.

before: 512 bytes
after: 496 bytes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f2db4d5c 26-Apr-2023 Filipe Manana <fdmanana@suse.com>

btrfs: make btrfs_free_device() static

The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 05bdb996 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: replace fmode_t with a block-specific type for block open flags

The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2736e8ee 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: use the holder as indication for exclusive opens

The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2ef78928 08-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't pass a holder for non-exclusive blkdev_get_by_path

Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead and remove the holder argument from the
call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls.

Exclusive mode for device scanning is not used since commit 50d281fc434c
("btrfs: scan device in non-exclusive mode")".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4886ff7b 19-Mar-2023 Qu Wenruo <wqu@suse.com>

btrfs: introduce a new helper to submit write bio for repair

Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
Even for read-repair on RAID56, we only update the corrupted data
stripe itself, not triggering the full RMW path.

- Requires a valid @mirror_num
For RAID56 case, only @mirror_num == 1 is valid.
For non-RAID56 cases, we need @mirror_num to locate our stripe.

- No data csum generation needed

These two call sites still have some differences though:

- Read-repair goes plain bio
It doesn't need a full btrfs_bio, and goes submit_bio_wait().

- New scrub repair would go btrfs_bio
To simplify both read and write path.

So here this patch would:

- Introduce a common helper, btrfs_map_repair_block()
Due to the single device nature, we can use an on-stack
btrfs_io_stripe to pass device and its physical bytenr.

- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
code
This is for the incoming scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 18d758a2 16-Feb-2023 Qu Wenruo <wqu@suse.com>

btrfs: replace btrfs_io_context::raid_map with a fixed u64 value

In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.

But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.

So why we waste the space and time (for sorting) for raid_map?

This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:

- Replace btrfs_io_context::raid_map with full_stripe_start

- Replace call sites using raid_map[0] to use full_stripe_start

- Replace call sites using raid_map[i] to compare with nr_data_stripes.

The benefits are:

- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs sizeof(u64).
It'll always save at least one u64, and the benefit grows larger with
num_stripes.

- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1faf3885 06-Feb-2023 Qu Wenruo <wqu@suse.com>

btrfs: use an efficient way to represent source of duplicated stripes

For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.

For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).

But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.

Currently we use a complex way using tgtdev_map[] array, e.g:

num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.

But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.

Thus we can change it to a fixed array to represent the mapping:

replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]

By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.

Thus the patch involves the following changes:

- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.

@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.

- Add extra ASSERT()s for RAID56 code

- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.

- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.

- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4ced85f8 06-Feb-2023 Qu Wenruo <wqu@suse.com>

btrfs: reduce type width of btrfs_io_contexts

That structure is our ultimate object for all __btrfs_map_block()
related functions. We have some hard to understand members, like
tgtdev_map, but without any comments.

This patch will improve the situation:

- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
tgtdev_map[]
Especially for the last two members, add a dedicated (thus very long)
comments for them, with example to explain it.

- Shrink those int members to u16.
In fact our on-disk format is only using u16 for num_stripes, thus
no need to use int at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a97699d1 16-Feb-2023 Qu Wenruo <wqu@suse.com>

btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN

Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.

There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.

Furthermore we can use the fix stripe length to do the following
optimization:

- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
Now we only need to do a right shift.

And the value of BTRFS_STRIPE_LEN itself is already too large for bit
shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
a compiler warning would be triggered.

Thus this bit shift optimization would be safe.

- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f8a02dc6 20-Jan-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove struct btrfs_io_geometry

Now that btrfs_get_io_geometry has a single caller, we can massage it
into a form that is more suitable for that caller and remove the
marshalling into and out of struct btrfs_io_geometry.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 103c1972 15-Nov-2022 Christoph Hellwig <hch@lst.de>

btrfs: split the bio submission path into a separate file

The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c. Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.

Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 27137fac 15-Nov-2022 Christoph Hellwig <hch@lst.de>

btrfs: move struct btrfs_tree_parent_check out of disk-io.h

Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
an various .c files don't have to include disk-io.h just for it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use tree-checker.h for the structure ]
Signed-off-by: David Sterba <dsterba@suse.com>


# cb3e217b 12-Nov-2022 Qu Wenruo <wqu@suse.com>

btrfs: use btrfs_dev_name() helper to handle missing devices better

[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:

BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)

Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".

[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).

Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.

[FIX]
Fix the output by:

- Move the btrfs_dev_name() helper to volumes.h

- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.

- Constify return value and parameter

Now the output looks pretty sane:

BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 947a6299 13-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: move tree block parentness check into validate_extent_buffer()

[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().

On the other hand, all the data verification is done at endio context.

[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.

With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.

And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.

This brings the following benefits:

- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.

- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bb21e302 07-Nov-2022 Anand Jain <anand.jain@oracle.com>

btrfs: move device->name RCU allocation and assign to btrfs_alloc_device()

There is a repeating code section in the parent function after calling
btrfs_alloc_device(), as below:

name = rcu_string_strdup(path, GFP_...);
if (!name) {
btrfs_free_device(device);
return ERR_PTR(-ENOMEM);
}
rcu_assign_pointer(device->name, name);

Except in add_missing_dev() for obvious reasons.

This patch consolidates that repeating code into the btrfs_alloc_device()
itself so that the parent function doesn't have to duplicate code.
This consolidation also helps to review issues regarding RCU lock
violation with device->name.

Parent function device_list_add() and add_missing_dev() use GFP_NOFS for
the allocation, whereas the rest of the parent functions use GFP_KERNEL,
so bring the NOFS allocation context using memalloc_nofs_save() in the
function device_list_add() and add_missing_dev() is already doing it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b31bed17 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_chunk_item_size out of ctree.h

This is used by the volumes code and the tree checker code. We want to
maintain inline however, so simply move it to volumes.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 63a7cb13 26-Jul-2022 David Sterba <dsterba@suse.com>

btrfs: auto enable discard=async when possible

There's a request to automatically enable async discard for capable
devices. We can do that, the async mode is designed to wait for larger
freed extents and is not intrusive, with limits to iops, kbps or latency.

The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .

The automatic selection is done if there's at least one discard capable
device in the filesystem (not capable devices are skipped). Mounting
with any other discard option will honor that option, notably mounting
with nodiscard will keep it disabled.

Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>


# a8d1b164 04-Nov-2022 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: initialize device's zone info for seeding

When performing seeding on a zoned filesystem it is necessary to
initialize each zoned device's btrfs_zoned_device_info structure,
otherwise mounting the filesystem will cause a NULL pointer dereference.

This was uncovered by fstests' testcase btrfs/163.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ae0e5df4 10-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: reorder btrfs_bio for better packing

After changes in commit 917f32a23501 ("btrfs: give struct btrfs_bio a
real end_io handler") the layout of btrfs_bio can be improved. There
are two holes and the structure size is 264 bytes on release build. By
reordering the iterator we can get rid of the holes and the size is 256
bytes which fits to slabs much better.

Final layout:

struct btrfs_bio {
unsigned int mirror_num; /* 0 4 */
struct bvec_iter iter; /* 4 20 */
u64 file_offset; /* 24 8 */
struct btrfs_device * device; /* 32 8 */
u8 * csum; /* 40 8 */
u8 csum_inline[64]; /* 48 64 */
/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
btrfs_bio_end_io_t end_io; /* 112 8 */
void * private; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct work_struct end_io_work; /* 128 32 */
struct bio bio; /* 160 96 */

/* size: 256, cachelines: 4, members: 10 */
};

Fixes: 917f32a23501 ("btrfs: give struct btrfs_bio a real end_io handler")
Signed-off-by: David Sterba <dsterba@suse.com>


# 2103da3b 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_swapfile_pin into volumes.h

This isn't a great spot for this, but one of the swapfile helper
functions is in volumes.c, so move the struct to volumes.h. In the
future when we have better separation of code there will be a more
natural spot for this.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c2e79e86 14-Sep-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_pinned_by_swapfile prototype into volumes.h

This is defined in volumes.c, move the prototype into volumes.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 917f32a2 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: give struct btrfs_bio a real end_io handler

Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.

To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.

Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# c3a62baf 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: use chained bios when cloning

The stripes_pending in the btrfs_io_context counts number of inflight
low-level bios for an upper btrfs_bio. For reads this is generally
one as reads are never cloned, while for writes we can trivially use
the bio remaining mechanisms that is used for chained bios.

To be able to make use of that mechanism, split out a separate trivial
end_io handler for the cloned bios that does a minimal amount of error
tracking and which then calls bio_endio on the original bio to transfer
control to that, with the remaining counter making sure it is completed
last. This then allows to merge btrfs_end_bioc into the original bio
bi_end_io handler.

To make this all work all error handling needs to happen through the
bi_end_io handler, which requires a small amount of reshuffling in
submit_stripe_bio so that the bio is cloned already by the time the
suitability of the device is checked.

This reduces the size of the btrfs_io_context and prepares splitting
the btrfs_bio at the stripe boundary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b42f5e3 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: pass the operation to btrfs_bio_alloc

Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d45cfb88 06-Aug-2022 Christoph Hellwig <hch@lst.de>

btrfs: move btrfs_bio allocation to volumes.c

volumes.c is the place that implements the storage layer using the
btrfs_bio structure, so move the bio_set and allocation helpers there
as well.

To make up for the new initialization boilerplate, merge the two
init/exit helpers in extent_io.c into a single one.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1a722d8f 16-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: do not return errors from btrfs_map_bio

Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches
what the block layer submission does and avoids any confusion on who
needs to handle errors.

As this requires touching all the callers, rename the function to
btrfs_submit_bio, which describes the functionality much better.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# ee5b46a3 21-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: increase direct io read size limit to 256 sectors

Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit c329861da406 ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef. That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).

There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O. Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.

This significantly improves direct read performance. For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:

Baseline:

READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec

With this patch:

READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0b30f719 13-May-2022 Qu Wenruo <wqu@suse.com>

btrfs: use btrfs_raid_array to calculate number of parity stripes

Use the raid table instead of hard coded values and rename the helper as
it is exported. This could make later extension on RAID56 based
profiles easier.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bc88b486 13-May-2022 Qu Wenruo <wqu@suse.com>

btrfs: remove parameter dev_extent_len from scrub_stripe()

For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.

Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a4012f06 03-Jun-2022 Christoph Hellwig <hch@lst.de>

btrfs: split discard handling out of btrfs_map_block

Mapping block for discard doesn't really share any code with the regular
block mapping case. Split it out into an entirely separate helper
that just returns an array of btrfs_discard_stripe structures and the
number of stripes.

This removes the need for the length field in the btrfs_io_context
structure, so remove tht.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9ff7ddd3 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: do not allocate a btrfs_bio for low-level bios

The bios submitted from btrfs_map_bio don't really interact with the
rest of btrfs and the only btrfs_bio member actually used in the
low-level bios is the pointer to the btrfs_io_context used for endio
handler.

Use a union in struct btrfs_io_stripe that allows the endio handler to
find the btrfs_io_context and remove the spurious ->device assignment
so that a plain fs_bio_set bio can be used for the low-level bios
allocated inside btrfs_map_bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# d7b9416f 26-May-2022 Christoph Hellwig <hch@lst.de>

btrfs: remove btrfs_end_io_wq

All reads bio that go through btrfs_map_bio need to be completed in
user context. And read I/Os are the most common and timing critical
in almost any file system workloads.

Embed a work_struct into struct btrfs_bio and use it to complete all
read bios submitted through btrfs_map, using the REQ_META flag to decide
which workqueue they are placed on.

This removes the need for a separate 128 byte allocation (typically
rounded up to 192 bytes by slab) for all reads with a size increase
of 24 bytes for struct btrfs_bio. Future patches will reorganize
struct btrfs_bio to make use of this extra space for writes as well.

(All sizes are based a on typical 64-bit non-debug build)

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 261d812b 22-May-2022 Qu Wenruo <wqu@suse.com>

btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks

Add a helper that works similar to __bio_for_each_segment, but instead of
iterating over PAGE_SIZE chunks it iterates over each sector.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: split from a larger patch, and iterate over the offset instead of
the offset bits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add parameter comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 719fae89 20-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: use ilog2() to replace if () branches for btrfs_bg_flags_to_raid_index()

In function btrfs_bg_flags_to_raid_index(), we use quite some if () to
convert the BTRFS_BLOCK_GROUP_* bits to a index number.

But the truth is, there is really no such need for so many branches at
all.
Since all BTRFS_BLOCK_GROUP_* flags are just one single bit set inside
BTRFS_BLOCK_GROUP_PROFILES_MASK, we can easily use ilog2() to calculate
their values.

This calculation has an anchor point, the lowest PROFILE bit, which is
RAID0.

Even it's fixed on-disk format and should never change, here I added
extra compile time checks to make it super safe:

1. Make sure RAID0 is always the lowest bit in PROFILE_MASK
This is done by finding the first (least significant) bit set of
RAID0 and PROFILE_MASK & ~RAID0.

2. Make sure RAID0 bit set beyond the highest bit of TYPE_MASK

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f04fbcc6 20-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: move definition of btrfs_raid_types to volumes.h

It's only internally used as another way to represent btrfs profiles,
it's not exposed through any on-disk format, in fact this
btrfs_raid_types is diverted from the on-disk format values.

Furthermore, since it's internal structure, its definition can change in
the future.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cc353a8b 12-Apr-2022 Qu Wenruo <wqu@suse.com>

btrfs: reduce width for stripe_len from u64 to u32

Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough
for the usage.

Furthermore, even in the future we choose to enlarge stripe length to
larger values, I don't believe we would want stripe as large as 4G or
larger.

So this patch will reduce the width for all in-memory structures and
parameters, this involves:

- RAID56 related function argument lists
This allows us to do direct division related to stripe_len.
Although we will use bits shift to replace the division anyway.

- btrfs_io_geometry structure
This involves one change to simplify the calculation of both @stripe_nr
and @stripe_offset, using div64_u64_rem().
And add extra sanity check to make sure @stripe_offset is always small
enough for u32.

This saves 8 bytes for the structure.

- map_lookup structure
This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes,
and removed a 4-bytes hole)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f9e69aa9 06-Apr-2022 Christoph Hellwig <hch@lst.de>

btrfs: simplify ->flush_bio handling

Use and embedded bios that is initialized when used instead of
bio_kmalloc plus bio_reset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220406061228.410163-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 00d82525 24-Mar-2022 Christoph Hellwig <hch@lst.de>

btrfs: fix direct I/O read repair for split bios

When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio. But this means the start value used
in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add
a file_offset field to struct btrfs_bio to pass along the correct offset.

Given that check_data_csum only uses start of an error message this
means problems with this miscalculation will only show up when I/O fails
or checksums mismatch.

The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical
member") but we need it due to the bio splitting.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4889bc05 11-Jan-2022 Anand Jain <anand.jain@oracle.com>

btrfs: add device major-minor info in the struct btrfs_device

Internally it is common to use the major-minor number to identify a
device and, at a few locations in btrfs, we use the major-minor number
to match the device.

So when we identify a new btrfs device through device add or device
replace or device-scan/ready save the device's major-minor (dev_t) in the
struct btrfs_device so that we don't have to call lookup_bdev() again.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 16cab91a 11-Jan-2022 Anand Jain <anand.jain@oracle.com>

btrfs: match stale devices by dev_t

After the commit "btrfs: harden identification of the stale device", we
don't have to match the device path anymore. Instead, we match the dev_t.
So pass in the dev_t instead of the device path, in the call chain
btrfs_forget_devices()->btrfs_free_stale_devices().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f26c9238 14-Dec-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove reada infrastructure

Currently there is only one user for btrfs metadata readahead, and
that's scrub.

But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).

Despite this, there are some extra problems related to metadata
readahead:

- Duplicated feature with btrfs_path::reada

- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.

- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.

This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.

- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.

Thus here I purpose to remove the metadata readahead mechanism
completely.

[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.

For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.

The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.

Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%

Comprehensive test methodology is in the cover letter of the patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 554aed7d 07-Dec-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: sink zone check into btrfs_repair_one_zone

Sink zone check into btrfs_repair_one_zone() so we don't need to do it
in all callers.

Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
a boolean function and return false in case it got called on a non-zoned
filesystem and true on a zoned filesystem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1a15eb72 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls

For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.

However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb

Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# faa775c4 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: add a btrfs_get_dev_args_from_path helper

We are going to want to populate our device lookup args outside of any
locks and then do the actual device lookup later, so add a helper to do
this work and make btrfs_find_device_by_devspec() use this helper for
now.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 562d7b15 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: handle device lookup with btrfs_dev_lookup_args

We have a lot of device lookup functions that all do something slightly
different. Clean this up by adding a struct to hold the different
lookup criteria, and then pass this around to btrfs_find_device() so it
can do the proper matching based on the lookup criteria.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# add9745a 05-Oct-2021 Anand Jain <anand.jain@oracle.com>

btrfs: add comments for device counts in struct btrfs_fs_devices

A bug was was checking a wrong device count before we delete the struct
btrfs_fs_devices in btrfs_rm_device(). To avoid future confusion and
easy reference add a comment about the various device counts that we have
in the struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f4f39fc5 08-Oct-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove btrfs_bio::logical member

The member btrfs_bio::logical is only initialized by two call sites:

- btrfs_repair_one_sector()
No corresponding site to utilize it.

- btrfs_submit_direct()
The corresponding site to utilize it is btrfs_check_read_dio_bio().

However for btrfs_check_read_dio_bio(), we can grab the file_offset from
btrfs_dio_private::file_offset directly.

Thus it turns out we don't really need that btrfs_bio::logical member at
all.

For btrfs_bio, the logical bytenr can be fetched from its
bio->bi_iter.bi_sector directly.

So let's just remove the member to save 8 bytes for structure btrfs_bio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c3a3b19b 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename struct btrfs_io_bio to btrfs_bio

Previously we had "struct btrfs_bio", which records IO context for
mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
btrfs specific info for logical bytenr bio.

With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
"btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.

The struct btrfs_bio changes meaning by this commit. There was a
suggested name like btrfs_logical_bio but it's a bit long and we'd
prefer to use a shorter name.

This could be a concern for backports to older kernels where the
different meaning could possibly cause confusion or bugs. Comparing the
new and old structures, there's no overlap among the struct members so a
build would break in case of incorrect backport.

We haven't had many backports to bio code anyway so this is more of a
theoretical cause of bugs and a matter of precaution but we'll need to
keep the semantic change in mind.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4c664611 15-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: rename btrfs_bio to btrfs_io_context

The structure btrfs_bio is used by two different sites:

- bio->bi_private for mirror based profiles
For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
how many mirrors are still pending, and save the original endio
function of the bio.

- RAID56 code
In that case, RAID56 only utilize the stripes info, and no long uses
that to trace the pending mirrors.

So btrfs_bio is not always bind to a bio, and contains more info for IO
context, thus renaming it will make the naming less confusing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d24fa5c1 23-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: convert latest_bdev type to btrfs_device and rename

In preparation to fix a bug in btrfs_show_devname().

Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a09f23c3 23-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: rename and switch to bool btrfs_chunk_readonly

btrfs_chunk_readonly() checks if the given chunk is writeable. It
returns 1 for readonly, and 0 for writeable. So the return argument type
bool shall suffice instead of the current type int.

Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
check if the bg is writeable, and helps to keep the logic at the parent
function simpler to understand.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f6f39f7a 18-Aug-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk

The user facing function used to allocate new chunks is
btrfs_chunk_alloc, unfortunately there is yet another similar sounding
function - btrfs_alloc_chunk. This creates confusion, especially since
the latter function can be considered "private" in the sense that it
implements the first stage of chunk creation and as such is called by
btrfs_chunk_alloc.

To avoid the awkwardness that comes with having similarly named but
distinctly different in their purpose function rename btrfs_alloc_chunk
to btrfs_create_chunk, given that the main purpose of this function is
to orchestrate the whole process of allocating a chunk - reserving space
into devices, deciding on characteristics of the stripe size and
creating the in-memory structures.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3fa421de 27-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: delay blkdev_put until after the device remove

When removing the device we call blkdev_put() on the device once we've
removed it, and because we have an EXCL open we need to take the
->open_mutex on the block device to clean it up. Unfortunately during
device remove we are holding the sb writers lock, which results in the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #407 Not tainted
------------------------------------------------------
losetup/11595 is trying to acquire lock:
ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_put+0x3a/0x220
btrfs_rm_device.cold+0x62/0xe5
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11595:
#0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc21255d4cb

So instead save the bdev and do the put once we've dropped the sb
writers lock in order to avoid the lockdep recursion.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 500a44c9 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: uninline btrfs_bg_flags_to_raid_index

The helper does a simple translation from block group flags to index to
the btrfs_raid_array table. There's no apparent reason to inline the
function, the translation happens usually once per function and is not
called in a loop.

Making it a proper function saves quite some binary code (x86_64,
release config):

text data bss dec hex filename
1164011 19253 14912 1198176 124860 pre/btrfs.ko
1161559 19253 14912 1195724 123ecc post/btrfs.ko

DELTA: -2451

Also add the const attribute as there are no side effects, this could
help compiler to optimize a few things without the function body.

Signed-off-by: David Sterba <dsterba@suse.com>


# 2eadb9e7 04-Jul-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_finish_chunk_alloc private to block-group.c

One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.

Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 79bd3712 29-Jun-2021 Filipe Manana <fdmanana@suse.com>

btrfs: rework chunk allocation to avoid exhaustion of the system chunk array

Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 43c0d1a5 13-Apr-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()

The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.

It got removed in 420343131970 ("btrfs: let callers of
btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
utilized it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# eb3b5053 09-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: scrub: per-device bandwidth control

Add sysfs interface to limit io during scrub. We relied on the ionice
interface to do that, eg. the idle class let the system usable while
scrub was running. This has changed when mq-deadline got widespread and
did not implement the scheduling classes. That was a CFQ thing that got
deleted. We've got numerous complaints from users about degraded
performance.

Currently only BFQ supports that but it's not a common scheduler and we
can't ask everybody to switch to it.

Alternatively the cgroup io limiting can be used but that also a
non-trivial setup (v2 required, the controller must be enabled on the
system). This can still be used if desired.

Other ideas that have been explored: piggy-back on ionice (that is set
per-process and is accessible) and interpret the class and classdata as
bandwidth limits, but this does not have enough flexibility as there are
only 8 allowed and we'd have to map fixed limits to each value. Also
adjusting the value would need to lookup the process that currently runs
scrub on the given device, and the value is not sticky so would have to
be adjusted each time scrub runs.

Running out of options, sysfs does not look that bad:

- it's accessible from scripts, or udev rules
- the name is similar to what MD-RAID has
(/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
- the value is sticky at least for filesystem mount time
- adjusting the value has immediate effect
- sysfs is available in constrained environments (eg. system rescue)
- the limit also applies to device replace

Sysfs:

- raw value is in bytes
- values written to the file accept suffixes like K, M
- file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
- 0 means use default priority of IO

The scheduler is a simple deadline one and the accuracy is up to nearest
128K.

Signed-off-by: David Sterba <dsterba@suse.com>


# 18bb8bbf 19-Apr-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: automatically reclaim zones

When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
by default.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f7ef5287 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: relocate block group to repair IO failure in zoned filesystems

When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.

We can consider three methods to repair an IO failure in zoned filesystems:

(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cfe94440 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing

Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.

Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1cd6121f 04-Feb-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: implement zoned chunk allocator

Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.

To implement the allocator, we need to extend the following functions for
a zoned filesystem.

- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size

init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.

dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.

dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.

With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.

Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 42034313 27-Jan-2021 Michal Rostecki <mrostecki@suse.com>

btrfs: let callers of btrfs_get_io_geometry pass the em

Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.

That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.

Change that to passing the extent mapping to the btrfs_get_io_geometry()
function as an argument.

This could improve performance in some cases. For very large
filesystems, i.e. several thousands of allocated chunks, not only this
avoids searching two times the rbtree, saving time, it may also help
reducing contention on the lock that protects the tree - thinking of
writeback starting for multiple inodes, other tasks allocating or
removing chunks, and anything else that requires access to the rbtree.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Michal Rostecki <mrostecki@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Filipe's analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>


# c41ec452 21-Jan-2021 Su Yue <l@damenly.su>

btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch

This effectively reverts commit d5c8238849e7 ("btrfs: convert
data_seqcount to seqcount_mutex_t").

While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):

[66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5
[66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
[66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
[66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
[66.441485] Call Trace:
[66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs]
[66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
[66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs]
[66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441619] ? __this_cpu_preempt_check+0xf/0x11
[66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs]
[66.441700] ? __lock_acquire+0x35f/0x2650
[66.441717] ? lockdep_hardirqs_on+0x87/0x120
[66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
[66.441724] ? call_rcu+0x2d3/0x530
[66.441731] ? __might_fault+0x41/0x90
[66.441736] ? kvm_sched_clock_read+0x15/0x50
[66.441740] ? sched_clock+0x8/0x10
[66.441745] ? sched_clock_cpu+0x13/0x180
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441768] __ia32_sys_ioctl+0x165/0x8a0
[66.441773] ? __this_cpu_preempt_check+0xf/0x11
[66.441785] ? __might_fault+0x89/0x90
[66.441791] __do_fast_syscall_32+0x54/0x80
[66.441796] do_fast_syscall_32+0x32/0x70
[66.441801] do_SYSENTER_32+0x15/0x20
[66.441805] entry_SYSENTER_32+0x9f/0xf2
[66.441808] EIP: 0xab7b5549
[66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
[66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
[66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[66.441833] irq event stamp: 42579
[66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
[66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
[66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
[66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60

========================================================================
btrfs_remove_chunk+0x58b/0x7b0:
__seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
(inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
(inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
========================================================================

The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.

After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init(). In this scenario, its second
parameter info->chunk_mutex is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.

The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.

Commit d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5b316468 10-Nov-2020 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: get zone information of zoned block devices

If a zoned block device is found, get its zone information (number of
zones and zone size). To avoid costly run-time zone report
commands to test the device zones type during block allocation, attach
the seq_zones bitmap to the device structure to indicate if a zone is
sequential or accept random writes. Also it attaches the empty_zones
bitmap to indicate if a zone is empty or not.

This patch also introduces the helper function btrfs_dev_is_sequential()
to test if the zone storing a block is a sequential write required zone
and btrfs_dev_is_empty_zone() to test if the zone is a empty zone.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b2598edf 02-Nov-2020 Anand Jain <anand.jain@oracle.com>

btrfs: remove unused argument seed from btrfs_find_device

Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
check if the parameter seed is true in the function btrfs_find_device().
This tells it whether to traverse the seed device list or not.

After this commit, the argument is unused and can be removed.

In device_list_add() it's not necessary because fs_devices always points
to the device's fs_devices. So with the devid+uuid matching, it will
find the right device and return, thus not needing to traverse seed
devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bacce86a 06-Nov-2020 Anand Jain <anand.jain@oracle.com>

btrfs: drop unused argument step from btrfs_free_extra_devids

Commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have
replace item with target device") dropped the multi stage operation of
btrfs_free_extra_devids() that does not need to check replace target
anymore and we can remove the 'step' argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 33fd2f71 28-Oct-2020 Anand Jain <anand.jain@oracle.com>

btrfs: create read policy framework

As of now, we use the pid method to read striped mirrored data, which
means process id determines the stripe id to read. This type of routing
typically helps in a system with many small independent processes tying
to read random data. On the other hand, the pid based read IO policy is
inefficient because if there is a single process trying to read a large
file, the overall disk bandwidth remains underutilized.

So this patch introduces a read policy framework so that we could add
more read policies, such as IO routing based on the device's wait-queue
or manual when we have a read-preferred device or a policy based on the
target storage caching.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d5c82388 21-Oct-2020 Davidlohr Bueso <dave@stgolabs.net>

btrfs: convert data_seqcount to seqcount_mutex_t

By doing so we can associate the sequence counter to the chunk_mutex
for lockdep purposes (compiled-out otherwise), the mutex is otherwise
used on the write side.
Also avoid explicitly disabling preemption around the write region as it
will now be done automatically by the seqcount machinery based on the
lock type.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 66d204a1 12-Oct-2020 Filipe Manana <fdmanana@suse.com>

btrfs: fix readahead hang and use-after-free after removing a device

Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

[162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
[162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
[162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
[162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
[162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
[162513.514751] Call Trace:
[162513.514761] __schedule+0x5ce/0xd00
[162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.514771] schedule+0x46/0xf0
[162513.514844] wait_current_trans+0xde/0x140 [btrfs]
[162513.514850] ? finish_wait+0x90/0x90
[162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
[162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
[162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
[162513.514894] kthread+0x153/0x170
[162513.514897] ? kthread_stop+0x2c0/0x2c0
[162513.514902] ret_from_fork+0x22/0x30
[162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
[162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
[162513.515682] Call Trace:
[162513.515688] __schedule+0x5ce/0xd00
[162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.515697] schedule+0x46/0xf0
[162513.515712] wait_current_trans+0xde/0x140 [btrfs]
[162513.515716] ? finish_wait+0x90/0x90
[162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
[162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
[162513.515761] iterate_supers+0x87/0xf0
[162513.515765] ksys_sync+0x60/0xb0
[162513.515768] __do_sys_sync+0xa/0x10
[162513.515771] do_syscall_64+0x33/0x80
[162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.515781] RIP: 0033:0x7f5238f50bd7
[162513.515782] Code: Bad RIP value.
[162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
[162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
[162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
[162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
[162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
[162513.516620] Call Trace:
[162513.516625] __schedule+0x5ce/0xd00
[162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.516634] schedule+0x46/0xf0
[162513.516647] wait_current_trans+0xde/0x140 [btrfs]
[162513.516650] ? finish_wait+0x90/0x90
[162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
[162513.516686] __vfs_setxattr+0x66/0x80
[162513.516691] __vfs_setxattr_noperm+0x70/0x200
[162513.516697] vfs_setxattr+0x6b/0x120
[162513.516703] setxattr+0x125/0x240
[162513.516709] ? lock_acquire+0xb1/0x480
[162513.516712] ? mnt_want_write+0x20/0x50
[162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
[162513.516723] ? preempt_count_add+0x49/0xa0
[162513.516725] ? __sb_start_write+0x19b/0x290
[162513.516727] ? preempt_count_add+0x49/0xa0
[162513.516732] path_setxattr+0xba/0xd0
[162513.516739] __x64_sys_setxattr+0x27/0x30
[162513.516741] do_syscall_64+0x33/0x80
[162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.516745] RIP: 0033:0x7f5238f56d5a
[162513.516746] Code: Bad RIP value.
[162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
[162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
[162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
[162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
[162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
[162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
[162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
[162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
[162513.517780] Call Trace:
[162513.517786] __schedule+0x5ce/0xd00
[162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.517796] schedule+0x46/0xf0
[162513.517810] wait_current_trans+0xde/0x140 [btrfs]
[162513.517814] ? finish_wait+0x90/0x90
[162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
[162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
[162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
[162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
[162513.517865] iterate_supers+0x87/0xf0
[162513.517869] ksys_sync+0x60/0xb0
[162513.517872] __do_sys_sync+0xa/0x10
[162513.517875] do_syscall_64+0x33/0x80
[162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.517881] RIP: 0033:0x7f5238f50bd7
[162513.517883] Code: Bad RIP value.
[162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
[162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
[162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
[162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
[162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
[162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
[162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
[162513.519160] Call Trace:
[162513.519165] __schedule+0x5ce/0xd00
[162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.519174] schedule+0x46/0xf0
[162513.519190] wait_current_trans+0xde/0x140 [btrfs]
[162513.519193] ? finish_wait+0x90/0x90
[162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
[162513.519222] btrfs_create+0x57/0x200 [btrfs]
[162513.519230] lookup_open+0x522/0x650
[162513.519246] path_openat+0x2b8/0xa50
[162513.519270] do_filp_open+0x91/0x100
[162513.519275] ? find_held_lock+0x32/0x90
[162513.519280] ? lock_acquired+0x33b/0x470
[162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
[162513.519287] ? _raw_spin_unlock+0x29/0x40
[162513.519295] do_sys_openat2+0x20d/0x2d0
[162513.519300] do_sys_open+0x44/0x80
[162513.519304] do_syscall_64+0x33/0x80
[162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.519309] RIP: 0033:0x7f5238f4a903
[162513.519310] Code: Bad RIP value.
[162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
[162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
[162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
[162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
[162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
[162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
[162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
[162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
[162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
[162513.520511] Call Trace:
[162513.520516] __schedule+0x5ce/0xd00
[162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[162513.520525] schedule+0x46/0xf0
[162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
[162513.520548] ? finish_wait+0x90/0x90
[162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
[162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
[162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
[162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
[162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
[162513.520643] ? do_sigaction+0xf3/0x240
[162513.520645] ? find_held_lock+0x32/0x90
[162513.520648] ? do_sigaction+0xf3/0x240
[162513.520651] ? lock_acquired+0x33b/0x470
[162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
[162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
[162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
[162513.520662] ? do_sigaction+0xf3/0x240
[162513.520671] ? __x64_sys_ioctl+0x83/0xb0
[162513.520672] __x64_sys_ioctl+0x83/0xb0
[162513.520677] do_syscall_64+0x33/0x80
[162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[162513.520681] RIP: 0033:0x7fc3cd307d87
[162513.520682] Code: Bad RIP value.
[162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
[162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
[162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
[162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
[162513.520703]
Showing all locks held in the system:
[162513.520712] 1 lock held by khungtaskd/54:
[162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
[162513.520728] 1 lock held by in:imklog/596:
[162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
[162513.520782] 1 lock held by btrfs-transacti/1356167:
[162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
[162513.520798] 1 lock held by btrfs/1356190:
[162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
[162513.520805] 1 lock held by fsstress/1356184:
[162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520811] 3 locks held by fsstress/1356185:
[162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
[162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520833] 1 lock held by fsstress/1356196:
[162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
[162513.520838] 3 locks held by fsstress/1356197:
[162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
[162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
[162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
[162513.520858] 2 locks held by btrfs/1356211:
[162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
[162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

>>> t = find_task(prog, 1356190)
>>> prog.stack_trace(t)
#0 __schedule+0x5ce/0xcfc
#1 schedule+0x46/0xe4
#2 schedule_timeout+0x1df/0x475
#3 btrfs_reada_wait+0xda/0x132
#4 scrub_stripe+0x2a8/0x112f
#5 scrub_chunk+0xcd/0x134
#6 scrub_enumerate_chunks+0x29e/0x5ee
#7 btrfs_scrub_dev+0x2d5/0x91b
#8 btrfs_ioctl+0x7f5/0x36e7
#9 __x64_sys_ioctl+0x83/0xb0
#10 do_syscall_64+0x33/0x77
#11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
struct reada_control *rc = handle;
struct btrfs_fs_info *fs_info = rc->fs_info;

while (atomic_read(&rc->elems)) {
if (!atomic_read(&fs_info->reada_works_cnt))
reada_start_machine(fs_info);
wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
(HZ + 9) / 10);
}
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

$ cat dump_reada.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

def dump_re(re):
nzones = re.nzones.value_()
print(f're at {hex(re.value_())}')
print(f'\t logical {re.logical.value_()}')
print(f'\t refcnt {re.refcnt.value_()}')
print(f'\t nzones {nzones}')
for i in range(nzones):
dev = re.zones[i].device
name = dev.name.str.string_()
print(f'\t\t dev id {dev.devid.value_()} name {name}')
print()

for _, e in radix_tree_for_each(fs_info.reada_tree):
re = cast('struct reada_extent *', e)
dump_re(re)

$ drgn dump_reada.py
re at 0xffff8f3da9d25ad8
logical 38928384
refcnt 1
nzones 1
dev id 0 name b'/dev/sdd'
$

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

$ cat dump_block_groups.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

BTRFS_BLOCK_GROUP_DATA = (1 << 0)
BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
BTRFS_BLOCK_GROUP_DUP = (1 << 5)
BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

def bg_flags_string(bg):
flags = bg.flags.value_()
ret = ''
if flags & BTRFS_BLOCK_GROUP_DATA:
ret = 'data'
if flags & BTRFS_BLOCK_GROUP_METADATA:
if len(ret) > 0:
ret += '|'
ret += 'meta'
if flags & BTRFS_BLOCK_GROUP_SYSTEM:
if len(ret) > 0:
ret += '|'
ret += 'system'
if flags & BTRFS_BLOCK_GROUP_RAID0:
ret += ' raid0'
elif flags & BTRFS_BLOCK_GROUP_RAID1:
ret += ' raid1'
elif flags & BTRFS_BLOCK_GROUP_DUP:
ret += ' dup'
elif flags & BTRFS_BLOCK_GROUP_RAID10:
ret += ' raid10'
elif flags & BTRFS_BLOCK_GROUP_RAID5:
ret += ' raid5'
elif flags & BTRFS_BLOCK_GROUP_RAID6:
ret += ' raid6'
elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
ret += ' raid1c3'
elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
ret += ' raid1c4'
else:
ret += ' single'

return ret

def dump_bg(bg):
print()
print(f'block group at {hex(bg.value_())}')
print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

bg_root = fs_info.block_group_cache_tree.address_of_()
for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
dump_bg(bg)

$ drgn dump_block_groups.py

block group at 0xffff8f3d673b0400
start 22020096 length 16777216
flags 258 - system raid6

block group at 0xffff8f3d53ddb400
start 38797312 length 536870912
flags 260 - meta raid6

block group at 0xffff8f3d5f4d9c00
start 575668224 length 2147483648
flags 257 - data raid6

block group at 0xffff8f3d08189000
start 2723151872 length 67108864
flags 258 - system raid6

block group at 0xffff8f3db70ff000
start 2790260736 length 1073741824
flags 260 - meta raid6

block group at 0xffff8f3d5f4dd800
start 3864002560 length 67108864
flags 258 - system raid6

block group at 0xffff8f3d67037000
start 3931111424 length 2147483648
flags 257 - data raid6
$

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
starting to scrub a stripe from block group X. This call to
btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
calls reada_add_block(), it passes the logical address of the extent
tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
It finds there isn't any existing readahead extent for the logical
address 38928384, so it proceeds to the path of creating a new one.

It calls btrfs_map_block() to find out which stripes exist for the block
group X. On the first iteration of the for loop that iterates over the
stripes, it finds the stripe for device /dev/sdd, so it creates one
zone for that device and adds it to the readahead extent. Before getting
into the second iteration of the loop, the cleanup kthread deletes block
group X because it was empty. So in the iterations for the remaining
stripes it does not add more zones to the readahead extent, because the
calls to reada_find_zone() returned NULL because they couldn't find
block group X anymore.

As a result the new readahead extent has a single zone, corresponding to
the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
for the readahead work queue, task B finishes the device replace and at
btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
"->elems" of the reada_control structure allocated at btrfs_reada_add().

Then it returns back to btrfs_reada_add() and calls
reada_start_machine(). This queues a job in the readahead work queue to
run the function reada_start_machine_worker(), which calls
__reada_start_machine().

At __reada_start_machine() we take the device list mutex and for each
device found in the current device list, we call
reada_start_machine_dev() to start the readahead work. However at this
point the device /dev/sdd was already freed and is not in the device
list anymore.

This means the corresponding readahead for the extent at 38928384 is
never started, and therefore the "->elems" counter of the reada_control
structure allocated at btrfs_reada_add() never goes down to 0, causing
the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
and the corresponding btrfs_device structure is freed. However the
readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
__readahead_hook() which ends up dropping the last reference on the
readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
and attempts to delete an element from the device's 'reada_extents'
radix tree, resulting in a use-after-free, as the device pointer of the
zone for /dev/sdd is now stale. We can also access the device after
dropping the last reference of a zone, through reada_zone_release(),
also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8d1a7aae 04-Dec-2019 Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com>

btrfs: annotate device name rcu_string with __rcu

This patch fixes the following sparse errors in
fs/btrfs/super.c in function btrfs_show_devname()

fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces):
fs/btrfs/super.c: struct rcu_string [noderef] <asn:4> *
fs/btrfs/super.c: struct rcu_string *

The error was because of the following line in function btrfs_show_devname():

if (first_dev)
seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\");

Annotating the btrfs_device::name member with __rcu fixes the sparse
error.

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 944d3f9f 16-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: switch seed device to list api

While this patch touches a bunch of files the conversion is
straighforward. Instead of using the implicit linked list anchored at
btrfs_fs_devices::seed the code is switched to using
list_for_each_entry.

Previous patches in the series already factored out code that processed
both main and seed devices so in those cases the factored out functions
are called on the main fs_devices and then on every seed dev inside
list_for_each_entry.

Using list api also allows to simplify deletion from the seed dev list
performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
substituting a while() loop with a simple list_del_init.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4989c2f 15-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: simplify setting/clearing fs_info to btrfs_fs_devices

It makes no sense to have sysfs-related routines be responsible for
properly initialising the fs_info pointer of struct btrfs_fs_device.
Instead this can be streamlined by making it the responsibility of
btrfs_init_devices_late to initialize it. That function already
initializes fs_info of every individual device in btrfs_fs_devices.

As far as clearing it is concerned it makes sense to move it to
close_fs_devices. That function is only called when struct
btrfs_fs_devices is no longer in use - either for holding seeds or
main devices for a mounted filesystem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 54eed6ae 15-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make close_fs_devices return void

The return value of this function conveys absolutely no information.
All callers already check the state of fs_devices->opened to decide how
to proceed. So convert the function to returning void. While at it make
btrfs_close_devices also return void.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 313b0858 20-Aug-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing

We need to move the closing of the src_device out of all the device
replace locking, but we definitely want to zero out the superblock
before we commit the last time to make sure the device is properly
removed. Handle this by pushing btrfs_scratch_superblocks into
btrfs_dev_replace_finishing, and then later on we'll move the src_device
closing and freeing stuff where we need it to be.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c31efbdf 03-Jul-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: record btrfs_device directly in btrfs_io_bio

Instead of recording stripe_index and using that to access correct
btrfs_device from btrfs_bio::stripes record the btrfs_device in
btrfs_io_bio. This will enable endio handlers to increment device
error counters on checksum errors.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c730ae0c 16-Jun-2020 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: convert comments to fallthrough annotations

Convert fall through comments to the pseudo-keyword which is now the
preferred way.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4a816c6 24-Feb-2020 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: introduce chunk allocation policy

Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 97f4dd09 18-Feb-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_check_uuid_tree private to disk-io.c

It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8f32380d 13-Feb-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: use the page cache for super block reading

Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.

Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae44 ("btrfs: access
superblock via pagecache in scan_one_device").

This patch unwraps the buffer head API and does not change the way the
super block is actually read.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6fbceb9f 13-Feb-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: reduce scope of btrfs_scratch_superblocks()

btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static. Also move it
above it's callers so we don't need a forward declaration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f6d9abbc 13-Feb-2020 Nikolay Borisov <nborisov@suse.com>

btrfs: Export btrfs_release_disk_super

Preparatory patch for removal of buffer_head usage in btrfs.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a013d141 12-Feb-2020 Anand Jain <anand.jain@oracle.com>

btrfs: sysfs, add UUID/devinfo kobject

Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 668e48af 06-Jan-2020 Anand Jain <anand.jain@oracle.com>

btrfs: sysfs, add devid/dev_state kobject and device attributes

New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.

in_fs_metadata - device is in the list of fs metadata
missing - device is missing (no device node or block device)
replace_target - device is target of replace
writeable - writes from fs are allowed

These attributes reflect the state of the device::dev_state and created
at mount time.

Sample output:
$ pwd
/sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
$ ls
in_fs_metadata missing replace_target writeable
$ cat missing
0

The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set. These attributes are readonly.

It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.

Note for device replace devid swap:

During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.

In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs. This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 96a14336 10-Dec-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Move and unexport btrfs_rmap_block

It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94f8c465 06-Jan-2020 David Sterba <dsterba@suse.com>

btrfs: remove unused member btrfs_device::work

This is a leftover from recently removed bio scheduling framework.

Fixes: ba8a9d079543 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# b5501504 21-Nov-2019 Anand Jain <anand.jain@oracle.com>

btrfs: sysfs, rename devices kobject holder to devices_kobj

The struct member btrfs_device::device_dir_kobj holds the kobj of the
sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from
device_dir_kobj to devices_kobj. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 94545870 15-Oct-2019 Thomas Gleixner <tglx@linutronix.de>

sched/rt, btrfs: Use CONFIG_PREEMPTION

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7f0432d0 13-Nov-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: change btrfs_fs_devices::rotating to bool

struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0395d84f 13-Nov-2019 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: change btrfs_fs_devices::seeding to bool

struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8d6fac00 02-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: add support for 4-copy replication (raid1c4)

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.

The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.

Signed-off-by: David Sterba <dsterba@suse.com>


# 47e6f742 02-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: add support for 3-copy replication (raid1c3)

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.

The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.

Signed-off-by: David Sterba <dsterba@suse.com>


# 34b127ae 23-Oct-2019 Qu Wenruo <wqu@suse.com>

btrfs: Remove btrfs_bio::flags member

The last user of btrfs_bio::flags was removed in commit 326e1dbb5736
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.

(Tagged for stable as the structure is heavily used and space savings
are desirable.)

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ba8a9d07 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: delete the entire async bio submission framework

Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 08635bae 10-Jul-2019 Chris Mason <clm@fb.com>

Btrfs: stop using btrfs_schedule_bio()

btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.

All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4143cb8b 01-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: add const function attribute

For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4e411a7d 07-Aug-2019 Anand Jain <anand.jain@oracle.com>

btrfs: reset device stat using btrfs_dev_stat_set

btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# adf4c0c5 01-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: remove unused btrfs_device::flush_bio_sent

The status of flush bio is tracked as a status bit, changed in commit
1c3063b6dbfa ("btrfs: cleanup device states define
BTRFS_DEV_STATE_FLUSH_SENT"), the flush_bio_sent was forgotten.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9e3246a5 19-Jul-2019 Qu Wenruo <wqu@suse.com>

btrfs: volumes: Unexport find_free_dev_extent_start()

This function is only used locally in find_free_dev_extent(), no
external callers.

So unexport it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 89b798ad 02-Jun-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Use btrfs_get_io_geometry appropriately

Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.

This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5f141126 02-Jun-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Introduce btrfs_io_geometry infrastructure

Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0b6f5d40 09-May-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Add comments on locking of several device-related fields

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 158da513 17-May-2019 David Sterba <dsterba@suse.com>

btrfs: refactor helper for bg flags to name conversion

The helper lacks the btrfs_ prefix and the parameter is the raw
blockgroup type, so none of the callers has to do the flags -> index
conversion.

Signed-off-by: David Sterba <dsterba@suse.com>


# 8c3e3582 17-May-2019 David Sterba <dsterba@suse.com>

btrfs: use u8 for raid_array members

The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just
small numbers so we don't have to use ints. New size is 7 * 32 = 224,
saving 3 cachelines.

Signed-off-by: David Sterba <dsterba@suse.com>


# c8bf1b67 17-May-2019 David Sterba <dsterba@suse.com>

btrfs: remove mapping tree structures indirection

fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.

Signed-off-by: David Sterba <dsterba@suse.com>


# 65237ee3 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev

We can read fs_info from the device and can drop it from the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 196c9d8d 20-Mar-2019 David Sterba <dsterba@suse.com>

btrfs: get fs_info from trans in btrfs_run_dev_stats

We can read fs_info from the transaction and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>


# 60dfdf25 27-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove 'trans' argument from find_free_dev_extent(_start)

Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 1c11b63e 27-Mar-2019 Jeff Mahoney <jeffm@suse.com>

btrfs: replace pending/pinned chunks lists with io tree

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8e75fd89 27-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Stop using call_rcu for device freeing

btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.

This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# bbbf7243 25-Mar-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: combine device update operations during transaction commit

We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.

We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ab4ba2e1 07-Mar-2019 Qu Wenruo <wqu@suse.com>

btrfs: tree-checker: Verify dev item

[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
#PF error: [normal kernel read fault]
PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
Call Trace:
open_ctree+0x160d/0x2149
btrfs_mount_root+0x5b2/0x680

[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.

But in this case, there is no seed device at all, causing NULL pointer.

[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.

Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 0a4c9265 23-Jan-2019 Gustavo A. R. Silva <gustavo@embeddedor.com>

fs: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warnings:

fs/affs/affs.h:124:38: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1692:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1694:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ceph/file.c:249:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:618:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:620:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/btrfs/ref-verify.c:522:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:711:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:722:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/jffs2/fs.c:339:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/nfsd/nfs4proc.c:429:12: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:62:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:43:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/fcntl.c:770:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/seq_file.c:319:10: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:148:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:150:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/signalfd.c:178:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/locks.c:1473:16: warning: this statement may fall through [-Wimplicit-fallthrough=]

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>


# 228a73ab 03-Jan-2019 Anand Jain <anand.jain@oracle.com>

btrfs: introduce new ioctl to unregister a btrfs device

Support for a new command that can be used eg. as a command

$ btrfs device scan --forget [dev]'
(the final name may change though)

to undo the effects of 'btrfs device scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty and is next to the
SCAN ioctl.

The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
(/dev/btrfs-control) to unregister one or all devices, devices that are
not mounted.

The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
path. To unregister all device, the path is an empty string.

Again, the devices are removed only if they aren't part of a mounte
filesystem.

This new ioctl provides:

- release of unwanted btrfs_fs_devices and btrfs_devices structures
from memory if the device is not going to be mounted

- ability to mount filesystem in degraded mode, when one devices is
corrupted like in split brain raid1

- running test cases which would require reloading the kernel module
but this is not possible eg. due to mounted filesystem or built-in

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 09ba3bc9 18-Jan-2019 Anand Jain <anand.jain@oracle.com>

btrfs: merge btrfs_find_device and find_device

Both btrfs_find_device() and find_device() does the same thing except
that the latter does not take the seed device onto account in the device
scanning context. We can merge them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e4319cd9 17-Jan-2019 Anand Jain <anand.jain@oracle.com>

btrfs: refactor btrfs_find_device() take fs_devices as argument

btrfs_find_device() accepts fs_info as an argument and retrieves
fs_devices from fs_info.

Instead use fs_devices, so that this function can be used in non-mount
(during device scanning) context as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7b41ba71 23-Nov-2018 Johannes Thumshirn <jthumshirn@suse.de>

btrfs: remove btrfs_bio_end_io_t

The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.

Remove both as they're not needed anywhere.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# b3a0dd50 22-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: replace btrfs_io_bio::end_io with a simple helper

The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.

This shrinks struct btrfs_io_bio by 8 bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# 31fecccb 22-Nov-2018 David Sterba <dsterba@suse.com>

btrfs: remove redundant csum buffer in btrfs_io_bio

The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.

This shrinks struct btrfs_io_bio by 8 bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>


# f89e09cf 20-Nov-2018 Anand Jain <anand.jain@oracle.com>

btrfs: add helper to describe block group flags

Factor out helper that describes block group flags from
describe_relocation. The result will not be longer than the given size.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# d1a63002 30-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: add members to fs_devices to track fsid changes

In order to gracefully handle split-brain scenario during fsid change
(which are very unlikely, yet possible), two more pieces of information
will be necessary:

1. The highest generation number among all devices registered to a
particular btrfs_fs_devices

2. A boolean flag whether a given btrfs_fs_devices was created by a
device which had the FSID_CHANGING_V2 flag set.

This is a preparatory patch and just introduces the variables as well
as code which sets them, their actual use is going to happen in a later
patch.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 7239ff4b 30-Oct-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Introduce support for FSID change without metadata rewrite

This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.

When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# b50836ed 04-Oct-2018 Hans van Kranenburg <hans.van.kranenburg@mendix.com>

btrfs: introduce nparity raid_attr

Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
nparity field in raid_attr.

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 60ca842e 16-May-2018 Omar Sandoval <osandov@fb.com>

Btrfs: rename and export get_chunk_map

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a27a94c2 02-Sep-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly

Instead of returning an error value and using one of the parameters for
returning the actual object we are interested in just refactor the
function to directly return btrfs_device *. Also bubble up the error
handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
btrfs_rm_device. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6c050407 02-Sep-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Make btrfs_find_device_missing_or_by_path return directly a device

This function returns a numeric error value and additionally the
device found in one of its input parameters. Simplify this by making
the function directly return a pointer to btrfs_device. Additionally
adjust the caller to handle the case when we want to remove the
'missing' device and ENOENT is returned to return the expected
positive error value, parsed by progs. Finally, unexport the function
since it's not called outside of volume.c. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cf90d884 31-Jul-2018 Qu Wenruo <wqu@suse.com>

btrfs: Introduce mount time chunk <-> dev extent mapping check

This patch will introduce chunk <-> dev extent mapping check, to protect
us against invalid dev extents or chunks.

Since chunk mapping is the fundamental infrastructure of btrfs, extra
check at mount time could prevent a lot of unexpected behavior (BUG_ON).

Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 97aff912 20-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_finish_chunk_alloc

It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4f5ad7bd 20-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdev

This function is always passed a well-formed tgtdevice so the fs_info
can be referenced from there.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d6507cf1 20-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_assign_next_active_device

It can be referenced from the passed 'device' argument which is always
a well-formed device.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 68a9db5f 20-Jul-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdev

It can be referenced from the passed srcdev argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 46df06b8 13-Jul-2018 David Sterba <dsterba@suse.com>

btrfs: refactor block group replication factor calculation to a helper

There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>


# 36350e95 12-Jul-2018 Gu Jinxiang <gujx@cn.fujitsu.com>

btrfs: return device pointer from btrfs_scan_one_device

Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.

And since btrfs_fs_devices can be obtained from btrfs_device, return that.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 621567a2 09-Jul-2018 Qu Wenruo <wqu@suse.com>

btrfs: Remove unused function btrfs_account_dev_extents_size

This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317207082856 ("btrfs: obsolete and remove
mount option alloc_start").

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fce466ea 03-Jul-2018 Qu Wenruo <wqu@suse.com>

btrfs: tree-checker: Verify block_group_item

A crafted image with invalid block group items could make free space cache
code to cause panic.

We could detect such invalid block group item by checking:
1) Item size
Known fixed value.
2) Block group size (key.offset)
We have an upper limit on block group item (10G)
3) Chunk objectid
Known fixed value.
4) Type
Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
No more than 1 bit set for profile type.
5) Used space
No more than the block group size.

This should allow btrfs to detect and refuse to mount the crafted image.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c216b203 20-Jun-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info from btrfs_alloc_chunk

It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6fcf6e2b 07-May-2018 David Sterba <dsterba@suse.com>

btrfs: remove redundant btrfs_balance_control::fs_info

The fs_info is always available from the context so we don't need to
store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 63a9c7b9 04-May-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove devid parameter from btrfs_rmap_block

This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# f9fbcaa2 25-Apr-2018 Anand Jain <anand.jain@oracle.com>

btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table

Add a new member struct btrfs_raid_attr::mindev_error so that
btrfs_raid_array can maintain the error code to return if the minimum
number of devices condition is not met while trying to delete a device
in the given raid. And so we can drop btrfs_raid_mindev_error.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 41a6e891 25-Apr-2018 Anand Jain <anand.jain@oracle.com>

btrfs: move btrfs_raid_group values to btrfs_raid_attr table

Add a new member struct btrfs_raid_attr::bg_flag so that
btrfs_raid_array can maintain the bit map flag of the raid type, and
so we can drop btrfs_raid_group.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ed23467b 25-Apr-2018 Anand Jain <anand.jain@oracle.com>

btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table

Add a new member struct btrfs_raid_attr::raid_name so that
btrfs_raid_array can maintain the name of the raid type, and so we can
drop btrfs_raid_type_names.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# d48f39d5 20-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static

The function logically belongs there and there's only a single caller,
no need to export it. No code changes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# a425f9d4 20-Mar-2018 David Sterba <dsterba@suse.com>

btrfs: export and rename free_device

The function will be used outside of volumes.c, the allocation
btrfs_alloc_device is also exported.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c4babc5e 11-Apr-2018 Anand Jain <anand.jain@oracle.com>

btrfs: rename struct btrfs_fs_devices::list

btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic
name 'list' makes it's search very difficult, rename it to fs_list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9888c340 03-Apr-2018 David Sterba <dsterba@suse.com>

btrfs: replace GPL boilerplate by SPDX -- headers

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>


# 9b99b115 26-Feb-2018 Anand Jain <anand.jain@oracle.com>

btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids

This function btrfs_close_extra_devices() is about freeing
extra devids which once it may have belonged to this filesystem.
So rename it and add the comment. The _devid suffix is
appropriate as this function won't handle devices which are
outside of the filesytem being mounted.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ffc5a379 19-Feb-2018 David Sterba <dsterba@suse.com>

btrfs: add (the only possible) __exit annotation

Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.

As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.

Signed-off-by: David Sterba <dsterba@suse.com>


# 3e72ee88 30-Jan-2018 Qu Wenruo <wqu@suse.com>

btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()

Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.

Although the name is a little longer, but it should explain its usage
quite well.

2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.

This also makes later code re-use between kernel and btrfs-progs
easier.

3) Remove function get_block_group_index()
Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e9b919b1 07-Feb-2018 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used

We already pass the btrfs_transaction which references fs_info so no
need to pass the later as an argument. Also use the opportunity to
shorten transaction->trans. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 15fc1283 12-Feb-2018 Anand Jain <anand.jain@oracle.com>

btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume()

btrfs_init_dev_replace_tgtdev_for_resume() initializes replace
target device in a few simple steps, so do it at the parent function.
Moreover, there isn't any other caller so just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3cbf26da 17-Jan-2018 Matthew Wilcox <willy@infradead.org>

btrfs: Remove unused readahead spinlock

The reada_lock in struct btrfs_device was only initialised, and not
actually used. That's good because there's another lock also called
reada_lock in the btrfs_fs_info that was quite heavily used. Remove
this one.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cadbc0a0 03-Jan-2018 Anand Jain <anand.jain@oracle.com>

btrfs: rename btrfs_device::scrub_device to scrub_ctx

btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6528b99d 18-Dec-2017 Anand Jain <Anand.Jain@oracle.com>

btrfs: factor btrfs_check_rw_degradable() to check given device

Update btrfs_check_rw_degradable() to check against the given device if
its lost.

We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails. Which is
needed when we are handling the device failed state.

A preparatory patch does not affect the flow as such.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ enhance comment ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 1c3063b6 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT

Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 401e29c1 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT

Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_MISSING and use the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e6e674bd 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING

Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by : Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e12c9621 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA

Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# ebbede42 03-Dec-2017 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE

Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 38b5f68e 29-Nov-2017 Anand Jain <anand.jain@oracle.com>

btrfs: drop btrfs_device::can_discard to query directly

We can query the bdev directly when needed at btrfs_discard_extent()
so drop btrfs_device::can_discard.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ac244ef1 20-Oct-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused function

It's sole callsite was removed in a previous patch so just nuke it for good.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 4660c49f 20-Oct-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove redundant memory barrier in dev stats

As per atomic_t.txt documentation :
- RMW operations that have a return value are fully ordered;

atomic_xchg is one such operation so it already includes everything it
needs w.r.t memory ordering and add a comment to be more explicit about
that.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9deae968 24-Oct-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Fix memory barriers usage with device stats counters

Commit addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev
stats is cleared") reworked the way device stats changes are tracked. A
new atomic dev_stats_ccnt counter was introduced which is incremented
every time any of the device stats counters are changed. This serves as
a flag whether there are any pending stats changes. However, this patch
only partially implemented the correct memory barriers necessary:

- It only ordered the stores to the counters but not the reads e.g.
btrfs_run_dev_stats
- It completely omitted any comments documenting the intended design and
how the memory barriers pair with each-other

This patch provides the necessary comments as well as adds a missing
smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only
a snapshot at best there was no point in reading the counter twice -
once in btrfs_dev_stats_dirty and then again when assigning stats_cnt.
Just collapse both reads into 1.

Fixes: addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 9f5316c1 23-Oct-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: free btrfs_device in place

It's pointless to defer it to a kthread helper as we're not under a
special context.

For reference, commit 1f78160ce1b1 ("Btrfs: using rcu lock in the reader
side of devices list") introduced RCU freeing for device structures.

Originally the blkdev_put was called from free_device and rcu_barrier had
to be called. This is no longer required, bdev and our device structures
are now freed separately.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 5a2b8e60 08-Oct-2017 Anand Jain <anand.jain@oracle.com>

btrfs: declare btrfs_report_missing_device() static

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 58efbc9f 23-Aug-2017 Omar Sandoval <osandov@fb.com>

Btrfs: fix blk_status_t/errno confusion

This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# db7c942c 16-Aug-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused sectorsize variable from struct map_lookup

This variable was added in 1abe9b8a138c ("Btrfs: add initial tracepointi
support for btrfs"), yet it never really got used, only assigned to. So
let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# e4ff5fb5 19-Jul-2017 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove unused parameters from volume.c functions

This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.

btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3abeb
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit

btrfs_is_parity_mirror's mirror_num - same as above

chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c4b ("Btrfs:
devid subset filter") and never used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# c5502451 08-Mar-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Enhance message when a device is missing during mount

For a missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:

BTRFS info (device vdb6): disk space caching is enabled
BTRFS info (device vdb6): has skinny extents
BTRFS error (device vdb6): failed to read the system array: -5
BTRFS error (device vdb6): open_ctree failed

This patch will print a new message about the missing device:

BTRFS info (device vdb6): disk space caching is enabled
BTRFS info (device vdb6): has skinny extents
BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b is missing
BTRFS error (device vdb6): failed to read the system array: -5
BTRFS error (device vdb6): open_ctree failed

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 21634a19 08-Mar-2017 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Introduce a function to check if all chunks a OK for degraded rw mount

Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.

But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.

Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw

If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.

This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>


# e0ae9994 06-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: preallocate device flush bio

For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.

Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.

Signed-off-by: David Sterba <dsterba@suse.com>


# fa1bcbe0 12-Jun-2017 David Sterba <dsterba@suse.com>

btrfs: document mandatory order of bio in btrfs_io_bio

Signed-off-by: David Sterba <dsterba@suse.com>


# 17347cec 15-May-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: change how we iterate bios in endio

Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 401b41e5 05-May-2017 Anand Jain <anand.jain@oracle.com>

btrfs: add framework to handle device flush error as a volume

This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 13e88e15 05-Apr-2017 Anand Jain <anand.jain@oracle.com>

btrfs: delete unused member nobarriers

The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.

[1]
commit b25de9d6da49b1a8760a89672283128aa8c78345
block: remove BIO_EOPNOTSUPP

And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 825ad4c9 28-Mar-2017 David Sterba <dsterba@suse.com>

btrfs: drop redundant parameters from btrfs_map_sblock

All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba <dsterba@suse.com>


# 592d92ee 14-Mar-2017 Liu Bo <bo.li.liu@oracle.com>

Btrfs: create a helper for getting chunk map

We have similar code here and there, this merges them into a helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 140475ae 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

btrfs: convert btrfs_bio.refs from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# da353f6b 14-Feb-2017 David Sterba <dsterba@suse.com>

btrfs: constify device path passed to relevant helpers

Signed-off-by: David Sterba <dsterba@suse.com>


# 34441361 04-Oct-2016 David Sterba <dsterba@suse.com>

btrfs: opencode chunk locking, remove helpers

The helpers are trivial and we don't use them consistently.

Signed-off-by: David Sterba <dsterba@suse.com>


# 2ff7e61e 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: take an fs_info directly when the root is not used otherwise

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 3796d335 16-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, lock/unlock_chunks

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# fb456252 22-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5112febb 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: btrfs_init_new_device should use fs_info->dev_root

btrfs_init_new_device only uses the root passed in via the ioctl to
start the transaction. Nothing else that happens is related to whatever
root the user used to initiate the ioctl. We can drop the root requirement
and just use fs_info->dev_root instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6bccf3ab 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: call functions that always use the same root with fs_info instead

There are many functions that are always called with the same root
argument. Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5b4aacef 21-Jun-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: call functions that overwrite their root parameter with fs_info

There are 11 functions that accept a root parameter and immediately
overwrite it. We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# cf8cddd3 27-Oct-2016 Christoph Hellwig <hch@lst.de>

btrfs: don't abuse REQ_OP_* flags for btrfs_map_block

btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 70fd7614 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block,fs: use REQ_* flags directly

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# ab8d0fc4 20-Sep-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: convert pr_* to btrfs_* where possible

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 81a75f67 05-Jun-2016 Mike Christie <mchristi@redhat.com>

btrfs: use bio fields for op and flags

The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# b3d3fa51 05-Jun-2016 Mike Christie <mchristi@redhat.com>

btrfs: update __btrfs_map_block for REQ_OP transition

We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block.
It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS,
so this drops the bit tests.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 3d8da678 26-Apr-2016 Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix divide error upon chunk's stripe_len

The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
__btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 88acff64 03-May-2016 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup assigning next active device with a check

Creates helper fucntion as needed by the device delete
and replace operations. Also now it checks if the next
device being assigned is an active device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 04cd01df 01-Apr-2016 Jeff Mahoney <jeffm@suse.com>

btrfs: uapi/linux/btrfs.h migration, move balance flags

The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags
and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 5c5c0df0 15-Feb-2016 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_find_device_by_user_input

For clarity how we are going to find the device, let's call it a device
specifier, devspec for short. Also rename the arguments that are a
leftover from previous function purpose.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 621292ba 15-Feb-2016 David Sterba <dsterba@suse.com>

btrfs: introduce raid-type to error-code table, for minimum device constraint

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 6b526ed7 12-Feb-2016 Anand Jain <anand.jain@oracle.com>

btrfs: introduce device delete by devid

This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
btrfs_ioctl_vol_args_v2 to carry devid as an user argument.

The patch won't delete the old ioctl interface and so kernel remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del 3 /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
[ adjust messages, s/disk/device/ ]
Signed-off-by: David Sterba <dsterba@suse.com>


# 24e0474b 12-Feb-2016 Anand Jain <anand.jain@oracle.com>

btrfs: create helper btrfs_find_device_by_user_input()

The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c, so that
delete device can use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# ee22184b 14-Dec-2015 Byongho Lee <bhlee.kernel@gmail.com>

Btrfs: use linux/sizes.h to represent constants

We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 87ad58c5 27-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: make btrfs_close_one_device static

Signed-off-by: David Sterba <dsterba@suse.com>


# da02c689 16-Nov-2015 David Sterba <dsterba@suse.com>

btrfs: fix clashing number of the enhanced balance usage filter

I've accidentally picked an already used number for the enhanced usage
filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with
BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase,
no backward compatibility issues.

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum")
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# b66d62ba 16-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: add balance filters limits, stripes and usage to supported mask

Enable the extended 'limit' syntax (a range), the new 'stripes' and
extended 'usage' syntax (a range) filters in the filters mask. The patch
comes separate and not within the series that introduced the new filters
because the patch adding the mask was merged in a late rc. The
integration branch was based on an older rc and could not merge the
patch due to the missing changes.

Prerequisities:
* btrfs: check unsupported filters in balance arguments
* btrfs: extend balance filter limit to take minimum and maximum
* btrfs: add balance filter for stripes
* btrfs: extend balance filter usage to take minimum and maximum

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# bc309467 20-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: extend balance filter usage to take minimum and maximum

Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# dee32d0a 28-Sep-2015 Gabríel Arthúr Pétursson <gabriel@system.is>

btrfs: add balance filter for stripes

Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.

Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>


# 12907fc7 10-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: extend balance filter limit to take minimum and maximum

The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.

Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 849ef928 12-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: check unsupported filters in balance arguments

We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8789f4fe 15-Sep-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures()

btrfs_raid_array[] is used to define all raid attributes, use it
to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(),
instead of complex condition in function.

It can make code simple and auto-support other possible raid-type in
future.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# af902047 15-Sep-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

btrfs: Move btrfs_raid_array to public

This array is used to record attributes of each raid type,
make it public, and many functions will benifit with this array.

For example, num_tolerated_disk_barrier_failures(), we can
avoid complex conditions in this function, and get raid attribute
simply by accessing above array.

It can also make code logic simple, reduce duplication code, and
increase maintainability.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8eb93459 12-Oct-2015 David Sterba <dsterba@suse.com>

btrfs: check unsupported filters in balance arguments

We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# f190aa47 14-Aug-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: add helper for closing one device

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>


# 12b1c263 14-Aug-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks

This patch updates and renames btrfs_scratch_superblocks, (which is used
by the replace device thread), with those fixes from the scratch
superblock code section of btrfs_rm_device(). The fixes are:
Scratch all copies of superblock
Notify kobject that superblock has been changed
Update time on the device

So that btrfs_rm_device() can use the function
btrfs_scratch_superblocks() instead of its own scratch code. And further
replace deivce code which similarly releases device back to the system,
will have the fixes from the btrfs device delete.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed to btrfs_scratch_superblock]
Signed-off-by: David Sterba <dsterba@suse.com>


# c1b7e474 14-Aug-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: rename super_kobj to fsid_kobj

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 499f377f 15-Jun-2015 Jeff Mahoney <jeffm@suse.com>

btrfs: iterate over unused chunk space in FITRIM

Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used. This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock. That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls. We release it
between searches to allow the file system to perform more-or-less
normally. Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search. This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 5a13f430 09-Mar-2015 Anand Jain <Anand.Jain@oracle.com>

Btrfs: sysfs: add pointer to access fs_info from fs_devices

adds fs_info pointer with struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# c73eccf7 09-Mar-2015 Anand Jain <anand.jain@oracle.com>

Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# 2e7910d6 09-Mar-2015 Anand Jain <Anand.Jain@oracle.com>

Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices

This patch will provide a framework and help to create attributes
from the structure btrfs_fs_devices which are available even before
fs_info is created. So by moving the parent kobject super_kobj from
fs_info to btrfs_fs_devices, it will help to create attributes
from the btrfs_fs_devices as well.

Patches on top of this patch now will be able to create the
sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
when devices are scanned and registered to the kernel.

Just to note, this does not change any of the existing btrfs sysfs
external kobject names and its attributes and not even the life
cycle of them. Changes are internal only. And to ensure the same,
this path has been tested with various device operations and,
checking and comparing the sysfs kobjects and attributes with
sysfs kobject and attributes with out this patch, and they remain
same.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# 326e1dbb 22-May-2015 Mike Snitzer <snitzer@redhat.com>

block: remove management of bi_remaining when restoring original bi_end_io

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 9eaed21e 01-Aug-2014 Eric Sandeen <sandeen@redhat.com>

btrfs: remove unused fs_info arg from btrfs_close_extra_devices()

The commit:
8dabb74 Btrfs: change core code of btrfs to support the
device replace operations
added the fs_info argument, but never used it -
just remove it again.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>


# 10f11900 20-Jan-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

Btrfs: Include map_type in raid_bio

Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
raid_map != NULL
or
raid_map[MAX_NR] == RAID[56]_Q_STRIPE

To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 6e9606d2 20-Jan-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

Btrfs: add ref_count and free function for btrfs_bio

1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
forced data type checking in compile.

Changelog v1->v2:
Rename following by David Sterba's suggestion:
put_btrfs_bio() -> btrfs_put_bio()
get_btrfs_bio() -> btrfs_get_bio()
bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 8e5cfb55 20-Jan-2015 Zhao Lei <zhaolei@cn.fujitsu.com>

Btrfs: Make raid_map array be inlined in btrfs_bio structure

It can make code more simple and clear, we need not care about
free bbio and raid_map together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 04216820 27-Nov-2014 Filipe Manana <fdmanana@suse.com>

Btrfs: fix race between fs trimming and block group remove/allocation

Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 2c8cdd6e 14-Nov-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs, replace: write dirty pages into the replace target device

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
RAID56, we add the stripes of the replace target devices at the
end of the stripe array in btrfs bio, and we sort those target
device stripes in the array. And we keep the number of the target
device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
take the target device stripes into account.
- When we do write operation, we read the data from the common devices
and calculate the parity. Then write the dirty data and new parity
out, at this time, we will find the relative replace target stripes
and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# af8e2d1d 23-Oct-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>


# 084b6e7c 30-Oct-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Fix a lockdep warning when running xfstest.

The following lockdep warning is triggered during xfstests:

[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1702.986799] Possible interrupt unsafe locking scenario:

[ 1702.987681] CPU0 CPU1
[ 1702.988137] ---- ----
[ 1702.988598] lock(&fs_info->dev_replace.lock);
[ 1702.989069] local_irq_disable();
[ 1702.989534] lock(&delayed_node->mutex);
[ 1702.990038] lock(&found->groups_sem);
[ 1702.990494] <Interrupt>
[ 1702.990938] lock(&delayed_node->mutex);
[ 1702.991407]
*** DEADLOCK ***

It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.

To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.

Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 47ab2a6c 18-Sep-2014 Josef Bacik <jbacik@fb.com>

Btrfs: remove empty block groups automatically

One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space. This happens because all the chunks were allocated for
data since the metadata requirements were so low. But now there's a bunch of
empty data block groups and not enough metadata space to do anything. This
patch solves this problem by automatically deleting empty block groups. If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it. This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c1dc0896 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: do file data check by sub-bio's self

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 67a2c45e 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix use-after-free problem of the device during device replace

The problem is:
Task0(device scan task) Task1(device replace task)
scan_one_device()
mutex_lock(&uuid_mutex)
device = find_device()
mutex_lock(&device_list_mutex)
lock_chunk()
rm_and_free_source_device
unlock_chunk()
mutex_unlock(&device_list_mutex)
check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7cc8e58d 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix unprotected device's variants on 32bits machine

->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# ce7213c7 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix wrong device bytes_used in the super block

device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 935e5cc9 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix wrong disk size when writing super blocks

total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 1c43366d 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix unprotected assignment of the target device

We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 90180da4 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cleanup unused num_can_discard in fs_devices

The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 443f24fe 23-Jul-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cleanup unused latest_devid and latest_trans in fs_devices

The member variants - latest_devid and latest_trans - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 6ba40b61 23-Jul-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: update the comment of total_bytes and disk_total_bytes of btrfs_devie

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# addc3fa7 23-Jul-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: Fix the problem that the dirty flag of dev stats is cleared

The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# d5ee37bc 23-Jul-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: make the device lock and its protected data in the same cacheline

The lock in btrfs_device structure was far away from its protected data, it would
make CPU load the cache line twice when we accessed them, move them together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>


# c55f1396 18-Jun-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix deadlock when mounting a degraded fs

The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
# mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
# echo 1 > /sys/block/`basename <dev0>`/device/delete
# mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>


# 7d824b6f 07-May-2014 David Sterba <dsterba@suse.cz>

btrfs: balance filter: add limit of processed chunks

This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.

In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>


# d458b054 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Cleanup the "_struct" suffix in btrfs_workequeue

Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# a8c93d4e 27-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com>

btrfs: Replace fs_info->submit_workers with btrfs_workqueue.

Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# c404e0dc 30-Jan-2014 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: fix use-after-free in the finishing procedure of the device replace

During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.

For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.

For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.

Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>


# 3c45bfc1 30-Oct-2013 Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>

btrfs: Pack struct btrfs_device

Pack the structure btrfs_device in volumes.h to eliminate holes detected
by pahole, thus reducing binary memory footprint.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 9b011adf 25-Oct-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Btrfs: remove scrub_super_lock holding in btrfs_sync_log()

Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 12bd2fc0 23-Aug-2013 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add btrfs_alloc_device and switch to it

Currently btrfs_device is allocated ad-hoc in a few different places,
and as a result not all fields are initialized properly. In particular,
readahead state is only initialized in device_list_add (at scan time),
and not in btrfs_init_new_device (when the new device is added with
'btrfs dev add'). Fix this by adding an allocation helper and switch
everybody but __btrfs_close_devices to it. (__btrfs_close_devices is
dealt with in a later commit.)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 70f80175 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: check UUID tree during mount if required

If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).

This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# f7a81ea4 15-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: create UUID tree if required

This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# facc8a22 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: don't cache the csum value into the extent state tree

Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 6df9a95e 27-Jun-2013 Josef Bacik <jbacik@fusionio.com>

Btrfs: make the chunk allocator completely tree lockless

When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock. To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff. We still keep the in-memory
stuff which makes sure everything is consistent.

One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice. This has the side
effect of fixing a bug with balance that has been there ever since balance
existed. Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction. So if you happen to crash
during a balance you could come back to a completely broken file system. This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits. This has passed all of the xfstests and my super annoying stress test
followed by a balance. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# cb517eab 15-May-2013 Miao Xie <miaox@cn.fujitsu.com>

Btrfs: cleanup the similar code of the fs root read

There are several functions whose code is similar, such as
btrfs_find_last_root()
btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 9be3395b 17-May-2013 Chris Mason <chris.mason@fusionio.com>

Btrfs: use a btrfs bioset instead of abusing bio internals

Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>


# 48a3b636 25-Apr-2013 Eric Sandeen <sandeen@redhat.com>

btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 55e301fd 28-Jan-2013 Filipe Brandenburger <filbranden@google.com>

Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h

The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# 53b381b3 29-Jan-2013 David Woodhouse <David.Woodhouse@intel.com>

Btrfs: RAID5 and RAID6

This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 31e50229 21-Nov-2012 Liu Bo <bo.li.liu@oracle.com>

Btrfs: put raid properties into global table

Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 8dabb742 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: change core code of btrfs to support the device replace operations

This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# e93c89c1 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add new sources for device replace code

This adds a new file to the sources together with the header file
and the changes to ioctl.h and ctree.h that are required by the
new C source file. Additionally, 4 new functions are added to
volume.c that deal with device creation and destruction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 63a212ab 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: disallow some operations on the device replace target device

This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# aa1b8cd4 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: pass fs_info instead of root

A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# a8a6dab7 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add btrfs_scratch_superblock() function

This new function is used by the device replace procedure in
a later patch.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 3ec706c8 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree

This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5d964051 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree

This is required for the device replace procedure in a later step.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 7ba15b7d 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add two more find_device() methods

The new function btrfs_find_device_missing_or_by_path() will be
used for the device replace procedure. This function itself calls
the second new function btrfs_find_device_by_path().
Unfortunately, it is not possible to currently make the rest of the
code use these functions as well, since all functions that look
similar at first view are all a little bit different in what they
are doing. But in the future, new code could benefit from these
two new functions, and currently, device replace uses them.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# d9d181c1 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: rename the scrub context structure

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.

Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>


# 5ee0844d 27-Aug-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: revert checksum error statistic which can cause a BUG()

Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.

Cc: stable <stable@vger.kernel.org> # 3.5
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 02db0844 21-Jun-2012 Josef Bacik <jbacik@fusionio.com>

Btrfs: add DEVICE_READY ioctl

This will be used in conjunction with btrfs device ready <dev>. This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently. This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>


# b27f7c0c 22-Jun-2012 David Sterba <dsterba@suse.cz>

btrfs: join DEV_STATS ioctls to one

Commit c11d2c236cc260b36 (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>


# 2b6ba629 22-Jun-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: resume balance on rw (re)mounts properly

This introduces btrfs_resume_balance_async(), which, given that
restriper state was recovered earlier by btrfs_recover_balance(),
resumes balance in btrfs-balance kthread.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 68310a5e 22-Jun-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: restore restriper state on all mounts

Fix a bug that triggered asserts in btrfs_balance() in both normal and
resume modes -- restriper state was not properly restored on read-only
mounts. This factors out resuming code from btrfs_restore_balance(),
which is now also called earlier in the mount sequence to avoid the
problem of some early writes getting the old profile.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 606686ee 04-Jun-2012 Josef Bacik <josef@redhat.com>

Btrfs: use rcu to protect device->name

Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>


# 733f4fbb 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: read device stats on mount, write modified ones during commit

The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# c11d2c23 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add ioctl to get and reset the device stats

An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 442a4f63 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de>

Btrfs: add device counters for detected IO and checksum errors

The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>


# 143bede5 01-Mar-2012 Jeff Mahoney <jeffm@suse.com>

btrfs: return void in functions without error conditions

Signed-off-by: Jeff Mahoney <jeffm@suse.com>


# 19a39dce 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add balance progress reporting

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a7e99c69 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: allow for canceling restriper

Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 837d5b6e 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: allow for pausing restriper

Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 59641015 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: recover balance on mount

On mount, if balance item is found, resume balance in a separate
kernel thread.

Try to be smart to continue roughly where previous balance (or convert)
was interrupted. For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cfa4c961 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: soft profile changing mode (aka soft convert)

When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to. This is useful if e.g. half of the FS was converted
earlier.

The soft mode switch is (like every other filter) per-type. This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e4d8ec0f 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: implement online profile changing

Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized. Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce. If target profile is not yet available it goes
back to a plain reduce.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ea67176a 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: virtual address space subset filter

Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 94e60d5a 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: devid subset filter

Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.

This filter only works when devid filter is turned on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 409d404b 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: devid filter

Relocate chunks which have at least one stripe located on a device with
devid X.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5ce5b3c0 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: usage filter

Select chunks that are less than X percent full.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ed25e9b2 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: profiles filter

Select chunks based on a given profile mask.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f43ffb60 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add basic infrastructure for selective balancing

This allows to have a separate set of filters for each chunk type
(data,meta,sys). The code however is generic and switch on chunk type
is only done once.

This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c9e9f97b 16-Jan-2012 Ilya Dryomov <idryomov@gmail.com>

Btrfs: add basic restriper infrastructure

Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 125ccb0a 08-Dec-2011 Li Zefan <lizf@cn.fujitsu.com>

Btrfs: don't pass a trans handle unnecessarily in volumes.c

Some functions never use the transaction handle passed to them.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>


# 387125fc 18-Nov-2011 Chris Mason <chris.mason@oracle.com>

Btrfs: fix barrier flushes

When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.

But, we have two bugs in the way these are sent down. When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.

In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.

Both of these bugs can cause corruptions on power failures. We fix it
with some new code to send down empty barriers to all devices before
writing the first super.

Alexandre Oliva found the multi-device bug. Arne Jansen did the async
barrier loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>


# 90519d66 23-May-2011 Arne Jansen <sensille@gmx.net>

btrfs: state information for readahead

Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>


# a1d3c478 04-Aug-2011 Jan Schmidt <list.btrfs@jan-o-sch.net>

btrfs: btrfs_multi_bio replaced with btrfs_bio

btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>


# d5e2003c 04-Aug-2011 Josef Bacik <josef@redhat.com>

Btrfs: detect wether a device supports discard

We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay. So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1f78160c 20-Apr-2011 Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>

Btrfs: using rcu lock in the reader side of devices list

fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 73c5de00 11-Apr-2011 Arne Jansen <sensille@gmx.net>

btrfs: quasi-round-robin for chunk allocation

In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.


# bcd53741 12-Apr-2011 Arne Jansen <sensille@gmx.net>

btrfs: move btrfs_cmp_device_free_bytes to super.c

this function won't be used here anymore, so move it super.c where it is
used for df-calculation


# a2de733c 08-Mar-2011 Arne Jansen <sensille@gmx.net>

btrfs: scrub

This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>


# f2a97a9d 04-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove all unused functions

Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>


# 621496f4 03-May-2011 David Sterba <dsterba@suse.cz>

btrfs: remove unused function prototypes

function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>


# fce3bb9a 24-Mar-2011 Li Dongyang <lidongyang@novell.com>

Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP

btrfs_map_block() will only return a single stripe length, but we want the
full extent be mapped to each disk when we are trimming the extent,
so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 1abe9b8a 24-Mar-2011 liubo <liubo2009@cn.fujitsu.com>

Btrfs: add initial tracepoint support for btrfs

Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
btrfs_sync_file
btrfs_sync_fs

These show sync arguments.

6) transaction:
btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6d07bcec 05-Jan-2011 Miao Xie <miaox@cn.fujitsu.com>

btrfs: fix wrong free space information of btrfs

When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.

# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10

It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.

This patch fixes it by calcing the free space that can be used to allocate
chunks.

Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1

This implementation is just likely fake chunk allocation.

After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b2117a39 05-Jan-2011 Miao Xie <miaox@cn.fujitsu.com>

btrfs: make the chunk allocator utilize the devices better

With this patch, we change the handling method when we can not get enough free
extents with default size.

Implementation:
1. Look up the suitable free extent on each device and keep the search result.
If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
succeeds.
3. If we can not get enough free extents, but the number of the extent with
default size is >= min_stripes, we just change the mapping information
(reduce the number of stripes in the extent map), and chunk allocation
succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
stripe size to allocate the device space

By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cd02dca5 13-Dec-2010 Chris Mason <chris.mason@oracle.com>

Btrfs: account for missing devices in RAID allocation profiles

When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d4d77629 13-Nov-2010 Tejun Heo <tj@kernel.org>

block: clean up blkdev_get() wrappers and their users

After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.

* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>


# c3b9a62c 18-Aug-2010 Christoph Hellwig <hch@infradead.org>

btrfs: replace barriers with explicit flush / FUA usage

Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# ba1bf481 11-Sep-2009 Josef Bacik <josef@redhat.com>

Btrfs: make balance code choose more wisely when relocating

Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.

V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.

-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.

-check to make sure the block group we are going to relocate isn't the last one
in that particular space

-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e5e9a520 10-Jun-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: avoid races between super writeout and device list updates

On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete. There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened with no transation writers running.

With the btrfs fsync log and other async transaction updates, this
has been racey for some time. This adds a mutex to protect
the device list. The existing volume mutex could not be reused due to
transaction lock ordering requirements.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# c289811c 10-Jun-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: autodetect SSD devices

During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS. If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d6397bae 27-Apr-2009 Chris Ball <cjb@laptop.org>

Btrfs: When shrinking, only update disk size on success

Previously, we updated a device's size prior to attempting a shrink
operation. This patch moves the device resizing logic to only happen if
the shrink completes successfully. In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ffbd517d 20-Apr-2009 Chris Mason <chris.mason@oracle.com>

Btrfs: use WRITE_SYNC for synchronous writes

Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.

Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.

This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# d4a78947 02-Apr-2009 Wu Fengguang <fengguang.wu@intel.com>

Btrfs: fix typos in comments

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e4404d6e 12-Dec-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: shared seed device

This patch makes seed device possible to be shared by
multiple mounted file systems. The sharing is achieved
by cloning seed device's btrfs_fs_devices structure.
Thanks you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# a512bbf8 08-Dec-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: superblock duplication

This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 97288f2c 02-Dec-2008 Christoph Hellwig <hch@lst.de>

Btrfs: corret fmode_t annotations

Make sure to propagate fmode_t properly and use the right constants for
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>


# 15916de8 19-Nov-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fixes for 2.6.28-rc API changes

* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 2b82032c 17-Nov-2008 Yan Zheng <zheng.yan@oracle.com>

Btrfs: Seed device support

Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>


# 7d2b4daa 05-Aug-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Fix the multi-bio code to save the original bio for completion

The multi-bio code is responsible for duplicating blocks in raid1 and
single spindle duplication. It has counters to make sure all of
the locations for a given extent are properly written before io completion
is returned to the higher layers.

But, it didn't always complete the same bio it was given, sometimes a
clone was completed instead. This lead to problems with the async
work queues because they saved a pointer to the bio in a struct off
bi_private.

The fix is to remember the original bio and only complete that one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8b712842 11-Jun-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add async worker threads for pre and post IO checksumming

Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system. But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up. The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines. Two worker pools are created,
one for writes and one for endio processing. The two could deadlock if
we tried to service both from a single pool.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a0af469b 13-May-2008 Chris Mason <chris.mason@oracle.com>

Fix btrfs_open_devices to deal with changes since the scan ioctls

Devices can change after the scan ioctls are done, and btrfs_open_devices
needs to be able to verify them as they are opened and used by the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# dfe25020 13-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add mount -o degraded to allow mounts to continue with missing devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a061fc8d 07-May-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for online device removal

This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev. This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface. Things got ugly keeping the mapping constant.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# a236aed1 29-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Deal with failed writes in mirrored configurations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# ec44a35c 28-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add balance ioctl to restripe the chunks

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 788f20eb 28-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add new ioctl to add devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8f18cf13 25-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Make the resizer work based on shrinking and growing devices

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# b3075717 22-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add a special device list for chunk allocations

This allows other code that needs to walk every device in the FS to do so
without locking against allocations.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f2d8d74d 21-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Make an unplug function that doesn't unplug every spindle

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# e17cade2 15-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add chunk uuids and update multi-device back references

Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f2984462 10-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Write out all super blocks on commit, and bring back proper barrier support

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# f188591e 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Retry metadata reads in the face of checksum failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# cea9e445 09-Apr-2008 Chris Mason <chris.mason@oracle.com>

Change btrfs_map_block to return a structure with mappings for all stripes

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8790d502 03-Apr-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for mirroring across drives

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 8a4b83cc 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for device scanning and detection ioctls

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 239b14b3 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Bring back mount -o ssd optimizations

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0d81ba5d 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Move device information into the super block so it can be scanned

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 6324fbf3 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Dynamic chunk and block group allocation

Signed-off-by: Chris Mason <chris.mason@oracle.com>


# 0b86a832 24-Mar-2008 Chris Mason <chris.mason@oracle.com>

Btrfs: Add support for multiple devices per filesystem

Signed-off-by: Chris Mason <chris.mason@oracle.com>