#
2f1aeab9 |
|
18-Mar-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: return accurate error code on open failure in open_fs_devices() When attempting to exclusive open a device which has no exclusive open permission, such as a physical device associated with the flakey dm device, the open operation will fail, resulting in a mount failure. In this particular scenario, we erroneously return -EINVAL instead of the correct error code provided by the bdev_open_by_path() function, which is -EBUSY. Fix this, by returning error code from the bdev_open_by_path() function. With this correction, the mount error message will align with that of ext4 and xfs. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f7eb840 |
|
29-Feb-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: validate device maj:min during open Boris managed to create a device capable of changing its maj:min without altering its device path. Only multi-devices can be scanned. A device that gets scanned and remains in the btrfs kernel cache might end up with an incorrect maj:min. Despite the temp-fsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature. To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan. CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov <boris@bur.io> Co-developed-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io># Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d565fffa |
|
12-Feb-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: do not skip re-registration for the mounted device There are reports that since version 6.7 update-grub fails to find the device of the root on systems without initrd and on a single device. This looks like the device name changed in the output of /proc/self/mountinfo: 6.5-rc5 working 18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ... 6.7 not working: 17 1 0:15 / / rw,noatime - btrfs /dev/root ... and "update-grub" shows this error: /usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?) This looks like it's related to the device name, but grub-probe recognizes the "/dev/root" path and tries to find the underlying device. However there's a special case for some filesystems, for btrfs in particular. The generic root device detection heuristic is not done and it all relies on reading the device infos by a btrfs specific ioctl. This ioctl returns the device name as it was saved at the time of device scan (in this case it's /dev/root). The change in 6.7 for temp_fsid to allow several single device filesystem to exist with the same fsid (and transparently generate a new UUID at mount time) was to skip caching/registering such devices. This also skipped mounted device. One step of scanning is to check if the device name hasn't changed, and if yes then update the cached value. This broke the grub-probe as it always read the device /dev/root and couldn't find it in the system. A temporary workaround is to create a symlink but this does not survive reboot. The right fix is to allow updating the device path of a mounted filesystem even if this is a single device one. In the fix, check if the device's major:minor number matches with the cached device. If they do, then we can allow the scan to happen so that device_list_add() can take care of updating the device path. The file descriptor remains unchanged. This does not affect the temp_fsid feature, the UUID of the mounted filesystem remains the same and the matching is based on device major:minor which is unique per mounted filesystem. This covers the path when the device (that exists for all mounted devices) name changes, updating /dev/root to /dev/sdx. Any other single device with filesystem and is not mounted is still skipped. Note that if a system is booted and initial mount is done on the /dev/root device, this will be the cached name of the device. Only after the command "btrfs device scan" it will change as it triggers the rename. The fix was verified by users whose systems were affected. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353 Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/ Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem") CC: stable@vger.kernel.org # 6.7+ Tested-by: Alex Romosan <aromosan@gmail.com> Tested-by: CHECK_1234543212345@protonmail.com Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ae6bd7f9 |
|
29-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix off-by-one chunk length calculation at contains_pending_extent() At contains_pending_extent() the value of the end offset of a chunk we found in the device's allocation state io tree is inclusive, so when we calculate the length we pass to the in_range() macro, we must sum 1 to the expression "physical_end - physical_offset". In practice the wrong calculation should be harmless as chunks sizes are never 1 byte and we should never have 1 byte ranges of unallocated space. Nevertheless fix the wrong calculation. Reported-by: Alex Lyakas <alex.lyakas@zadara.com> Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/ Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0782303a |
|
24-Feb-2024 |
Anand Jain <anand.jain@oracle.com> |
btrfs: include device major and minor numbers in the device scan notice To better debug issues surrounding device scans, include the device's major and minor numbers in the device scan notice for btrfs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cdeac6d |
|
22-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_device to btrfs_scratch_superblocks() Replace the two parameters bdev and name by one that can be used to get them both. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e6052347 |
|
16-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: move balance args conversion helpers to volumes.c The from/to CPU/disk helpers for balance args are used only in volumes, no need to define them in accessors.h. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
53e4d8c2 |
|
24-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: change BUG_ON to assertion in reset_balance_state() The balance state machine is complex so it's good to verify the assumptions in helpers, however reset_balance_state() is used at the end of balance and fs_info::balance_ctl is properly set up before and protected by the exclusive op ownership in btrfs_balance(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7411055d |
|
23-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks() The unhandled case in btrfs_relocate_sys_chunks() loop is a corruption, as it could be caused only by two impossible conditions: - at first the search key is set up to look for a chunk tree item, with offset -1, this is an inexact search and the key->offset will contain the correct offset upon a successful search, a valid chunk tree item cannot have an offset -1 - after first successful search, the found_key corresponds to a chunk item, the offset is decremented by 1 before the next loop, it's impossible to find a chunk item there due to alignment and size constraints Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4dc4a3be |
|
01-Feb-2024 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: use READ/WRITE_ONCE for fs_devices->read_policy Since we can read/modify the value from the sysfs interface concurrently, it would be better to protect it from compiler optimizations. Currently, there is only one read policy BTRFS_READ_POLICY_PID available, so no actual problem can happen now. This is a preparation for the future expansion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b712e3b |
|
25-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused included headers With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ae061cf |
|
23-Jan-2024 |
Christian Brauner <brauner@kernel.org> |
btrfs: port device access to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-19-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
d967c914 |
|
21-Dec-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: fix unbalanced unlock of mapping_tree_lock The error path of btrfs_get_chunk_map() releases fs_info->mapping_tree_lock. But, it is taken and released in btrfs_find_chunk_map(). So, there is no need to do so. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e94dfb7a |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: pass btrfs_io_geometry into btrfs_max_io_len Instead of passing three individual members of 'struct btrfs_io_geometry' into btrfs_max_io_len(), pass a pointer to btrfs_io_geometry. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6edf6822 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: pass struct btrfs_io_geometry to set_io_stripe Instead of passing three members of 'struct btrfs_io_geometry' into set_io_stripe() pass a pointer to the whole structure and then get the needed members out of btrfs_io_geometry. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
89f547c6 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: open code set_io_stripe for RAID56 Open code set_io_stripe() for RAID56, as it a) uses a different method to calculate the stripe_index b) doesn't need to go through raid-stripe-tree mapping code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b55b3077 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: change block mapping to switch/case in btrfs_map_block Now that all the per-profile if/else statement blocks have been converted to calls to helper the conversion to switch/case is straightforward. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a16fb8c6 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out block mapping for single profiles Now that we have a container for the I/O geometry that has all the needed information for the block mappings of SINGLE profiles, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
089221d3 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out block mapping for RAID5/6 Now that we have a container for the I/O geometry that has all the needed information for the block mappings of RAID5 and RAID6, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d9d4ce9f |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: reduce scope of data_stripes in btrfs_map_block Reduce the scope of 'data_stripes' in btrfs_map_block(). While the change alone may not make too much sense, it helps us factoring out a helper function for the block mapping of RAID56 I/O. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8938f112 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out block mapping for RAID10 Now that we have a container for the I/O geometry that has all the needed information for the block mappings of RAID10, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5aeb15c8 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out block mapping for DUP profiles Now that we have a container for the I/O geometry that has all the needed information for the block mappings of DUP, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e36aba8 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out RAID1 block mapping Now that we have a container for the I/O geometry that has all the needed information for the block mappings of RAID1, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
30e8534b |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out block-mapping for RAID0 Now that we have a container for the I/O geometry that has all the needed information for the block mappings of RAID0, factor out a helper calculating this information. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd747f2d |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: re-introduce struct btrfs_io_geometry Re-introduce struct btrfs_io_geometry, holding the necessary bits and pieces needed in btrfs_map_block() to decide the I/O geometry of a specific block mapping. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02d05b64 |
|
13-Dec-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: factor out helper for single device IO check The check in btrfs_map_block() deciding if a particular I/O is targeting a single device is getting more and more convoluted. Factor out the check conditions into a helper function, with no functional change otherwise. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7dc66abb |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use a dedicated data structure for chunk maps Currently we abuse the extent_map structure for two purposes: 1) To actually represent extents for inodes; 2) To represent chunk mappings. This is odd and has several disadvantages: 1) To create a chunk map, we need to do two memory allocations: one for an extent_map structure and another one for a map_lookup structure, so more potential for an allocation failure and more complicated code to manage and link two structures; 2) For a chunk map we actually only use 3 fields (24 bytes) of the respective extent map structure: the 'start' field to have the logical start address of the chunk, the 'len' field to have the chunk's size, and the 'orig_block_len' field to contain the chunk's stripe size. Besides wasting a memory, it's also odd and not intuitive at all to have the stripe size in a field named 'orig_block_len'. We are also using 'block_len' of the extent_map structure to contain the chunk size, so we have 2 fields for the same value, 'len' and 'block_len', which is pointless; 3) When an extent map is associated to a chunk mapping, we set the bit EXTENT_FLAG_FS_MAPPING on its flags and then make its member named 'map_lookup' point to the associated map_lookup structure. This means that for an extent map associated to an inode extent, we are not using this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform); 4) Extent maps associated to a chunk mapping are never merged or split so it's pointless to use the existing extent map infrastructure. So add a dedicated data structure named 'btrfs_chunk_map' to represent chunk mappings, this is basically the existing map_lookup structure with some extra fields: 1) 'start' to contain the chunk logical address; 2) 'chunk_len' to contain the chunk's length; 3) 'stripe_size' for the stripe size; 4) 'rb_node' for insertion into a rb tree; 5) 'refs' for reference counting. This way we do a single memory allocation for chunk mappings and we don't waste memory for them with unused/unnecessary fields from an extent_map. We also save 8 bytes from the extent_map structure by removing the 'map_lookup' pointer, so the size of struct extent_map is reduced from 144 bytes down to 136 bytes, and we can now have 30 extents map per 4K page instead of 28. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5031660a |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: mark sanity checks when getting chunk map as unlikely When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity checks to verify that we found an extent map and that it includes the requested logical address. These are never expected to fail, so mark them as unlikely to make it more clear as well as to allow a compiler to generate more efficient code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7d410d5e |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make error messages more clear when getting a chunk map When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity checks to verify we found a chunk map and that map found covers the logical address the caller passed in. However the messages aren't very clear in the sense that don't mention the issue is with a chunk map and one of them prints the 'length' argument as if it were the end offset of the requested range (while the in the string format we use %llu-%llu which suggests a range, and the second %llu-%llu is actually a range for the chunk map). So improve these two details in the error messages. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5fba5a57 |
|
21-Nov-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix off-by-one when checking chunk map includes logical address At btrfs_get_chunk_map() we get the extent map for the chunk that contains the given logical address stored in the 'logical' argument. Then we do sanity checks to verify the extent map contains the logical address. One of these checks verifies if the extent map covers a range with an end offset behind the target logical address - however this check has an off-by-one error since it will consider an extent map whose start offset plus its length matches the target logical address as inclusive, while the fact is that the last byte it covers is behind the target logical address (by 1). So fix this condition by using '<=' rather than '<' when comparing the extent map's "start + length" against the target logical address. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cd63ffbd |
|
30-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix error pointer dereference after failure to allocate fs devices At device_list_add() we allocate a btrfs_fs_devices structure and then before checking if the allocation failed (pointer is ERR_PTR(-ENOMEM)), we dereference the error pointer in a memcpy() argument if the feature BTRFS_FEATURE_INCOMPAT_METADATA_UUID is enabled. Fix this by checking for an allocation error before trying the memcpy(). Fixes: f7361d8c3fc3 ("btrfs: sipmlify uuid parameters of alloc_fs_devices()") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86ec15d0 |
|
27-Sep-2023 |
Jan Kara <jack@suse.cz> |
btrfs: Convert to bdev_open_by_path() Convert btrfs to use bdev_open_by_path() and pass the handle around. We also drop the holder from struct btrfs_device as it is now not needed anymore. CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
c47b02c1 |
|
04-Oct-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: disable the seed feature for temp-fsid A seed device is an integral component of the sprout device, which functions as a multi-device filesystem. Therefore, temp-fsid feature is not supported. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a5b8a5f9 |
|
27-Sep-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: support cloned-device mount capability Guilherme's previous work [1] aimed at the mounting of cloned devices using a superblock flag SINGLE_DEV during mkfs. [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/ Building upon this work, here is in memory only approach. As it mounts we determine if the same fsid is already mounted if then we generate a random temp fsid which shall be used the mount, in memory only not written to the disk. We distinguish devices by devt. Example: $ fallocate -l 300m ./disk1.img $ mkfs.btrfs -f ./disk1.img $ cp ./disk1.img ./disk2.img $ cp ./disk1.img ./disk3.img $ mount -o loop ./disk1.img /btrfs $ mount -o ./disk2.img /btrfs1 $ mount -o ./disk3.img /btrfs2 $ btrfs fi show -m Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop0 Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop1 Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop2 Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69d427f3 |
|
27-Sep-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add helper function find_fsid_by_disk In preparation for adding support to mount multiple single-disk btrfs filesystems with the same FSID, wrap find_fsid() into find_fsid_by_disk(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f2d3c01 |
|
27-Sep-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: increase ->free_chunk_space in btrfs_grow_device My overcommit patch exposed a bug with btrfs/177 [1]. The problem here is that when we grow the device we're not adding to ->free_chunk_space, so subsequent allocations can cause ->free_chunk_space to wrap, which causes problems in can_overcommit because we add this to ->total_bytes, which causes the counter to wrap and gives us an unexpected ENOSPC. Fix this by properly updating ->free_chunk_space with the new available space in btrfs_grow_device. [1] First version of the fix: https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9fd2c05 |
|
27-Sep-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix ->free_chunk_space math in btrfs_shrink_device There are two bugs in how we adjust ->free_chunk_space in btrfs_shrink_device. First we're removing the entire diff between new_size and old_size from ->free_chunk_space. This only works if we're reducing the free area, which we could potentially not be. So adjust the math to only subtract the diff in the free space from ->free_chunk_space. Additionally in the error case we're unconditionally adding the diff back into ->free_chunk_space, which we need to only do if this device is writeable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5966930d |
|
20-Sep-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove incomplete metadata_uuid conversion fixup logic Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has stopped the assembly of devices with the CHANGING_FSID_V2 flag in the kernel. Such devices can be scanned but will not be registered and can't be mounted without a manual fix by btrfstune. Remove the related logic and now unused code. The original motivation was to allow an interrupted partial conversion fix itself on next mount, in case the system has to be rebooted. This is a convenience but brings a lot of complexity the device scanning and handling the partial states. It's hard to estimate if this was ever needed in practice, expecting the typical use case like a manual conversion of an unmounted filesystem where the user can verify the success and rerun it eventually. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add historical context ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
197a9ece |
|
20-Sep-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: reject devices with CHANGING_FSID_V2 The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state where the device in the userspace btrfstune -m|-M operation failed to complete changing the fsid. This flag makes the kernel to automatically determine the other partner devices to which a given device can be associated, based on the fsid, metadata_uuid and generation values. btrfstune -m|M feature is especially useful in virtual cloud setups, where compute instances (disk images) are quickly copied, fsid changed, and launched. Given numerous disk images with the same metadata_uuid but different fsid, there's no clear way a device can be correctly assembled with the proper partners when the CHANGING_FSID_V2 flag is set. So, the disk could be assembled incorrectly, as in the example below: Before this patch: Consider the following two filesystems: /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m operation fails. In this scenario, as the /dev/loop0's fsid change is interrupted, and the CHANGING_FSID_V2 flag is set as shown below. $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags" $ btrfs inspect dump-super /dev/loop0 | egrep '$p' superblock: bytenr=65536, device=/dev/loop0 flags 0x1000000001 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 9 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop1 | egrep '$p' superblock: bytenr=65536, device=/dev/loop1 flags 0x1 fsid 11d2af4d-1b71-45a9-83f6-f2100766939d metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 10 num_devices 2 incompat_flags 0x741 dev_item.devid 2 $ btrfs inspect dump-super /dev/loop2 | egrep '$p' superblock: bytenr=65536, device=/dev/loop2 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 1 $ btrfs inspect dump-super /dev/loop3 | egrep '$p' superblock: bytenr=65536, device=/dev/loop3 flags 0x1 fsid 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 metadata_uuid bb040a9f-233a-4de2-ad84-49aa5a28059b generation 8 num_devices 2 incompat_flags 0x741 dev_item.devid 2 It is normal that some devices aren't instantly discovered during system boot or iSCSI discovery. The controlled scan below demonstrates this. $ btrfs device scan --forget $ btrfs device scan /dev/loop0 Scanning for btrfs filesystems on '/dev/loop0' $ mount /dev/loop3 /btrfs $ btrfs filesystem show -m Label: none uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45 Total devices 2 FS bytes used 144.00KiB devid 1 size 300.00MiB used 48.00MiB path /dev/loop0 devid 2 size 300.00MiB used 40.00MiB path /dev/loop3 /dev/loop0 and /dev/loop3 are incorrectly partnered. This kernel patch removes functions and code connected to the CHANGING_FSID_V2 flag. With this patch, now devices with the CHANGING_FSID_V2 flag are rejected. And its partner will fail to mount with the extra -o degraded option. The check is removed from open_ctree(), devices are rejected during scanning which in turn fails the mount. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
568220fa |
|
14-Sep-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: support RAID0/1/10 on top of raid stripe tree When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for data block groups. For metadata block groups, we don't actually need anything special, as all metadata I/O is protected by the btrfs_zoned_meta_io_lock() already. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
10e27980 |
|
14-Sep-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: lookup physical address from stripe extent Lookup the physical address from the raid stripe tree when a read on an RAID volume formatted with the raid stripe tree was attempted. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02c372e1 |
|
14-Sep-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: add support for inserting raid stripe extents Add support for inserting stripe extents into the raid stripe tree on completion of every write that needs an extra logical-to-physical translation when using RAID. Inserting the stripe extents happens after the data I/O has completed, this is done to a) support zone-append and b) rule out the possibility of a RAID-write-hole. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50564b65 |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc27d6f0 |
|
08-Sep-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: scan but don't register device on single device filesystem After the commit 5f58d783fd78 ("btrfs: free device in btrfs_close_devices for a single device filesystem") we unregister the device from the kernel memory upon unmounting for a single device. So, device registration that was performed before mounting if any is no longer in the kernel memory. However, in fact, note that device registration is unnecessary for a single-device btrfs filesystem unless it's a seed device. So for commands like 'btrfs device scan' or 'btrfs device ready' with a non-seed single-device btrfs filesystem, they can return success just after superblock verification and without the actual device scan. When 'device scan --forget' is called on such device no error is returned. The seed device must remain in the kernel memory to allow the sprout device to mount without the need to specify the seed device explicitly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9fb2acc2 |
|
17-Sep-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the need_raid_map parameter from btrfs_map_block() The parameter @need_raid_map is mostly a legacy from the old days where we don't yet have a solid definition on the @mirror_num, and only check-integrity was using that parameter, while all other call sites just pass 1 for that parameter. Now since we have removed check-integrity functionality, we can also remove the @need_raid_map parameter. This change will also remove the ability to read P/Q stripe directly when passing 0 as @need_raid_map. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9580503b |
|
07-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: reformat remaining kdoc style comments Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7361d8c |
|
23-Aug-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: sipmlify uuid parameters of alloc_fs_devices() Among all the callers, only the device_list_add() function uses the second argument of alloc_fs_devices(). It passes metadata_uuid when available, otherwise, it passes NULL. And in turn, alloc_fs_devices() is designed to copy either metadata_uuid or fsid into fs_devices::metadata_uuid. So remove the second argument in alloc_fs_devices(), and always copy the fsid. In the caller device_list_add() function, we will overwrite it with metadata_uuid when it is available. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8a540e99 |
|
06-Oct-2023 |
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> |
btrfs: fix stripe length calculation for non-zoned data chunk allocation Commit f6fca3917b4d "btrfs: store chunk size in space-info struct" broke data chunk allocations on non-zoned multi-device filesystems when using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk size and stripe length calculation" partially fixed that, and this patch completes the fix for that case. After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for a data chunk allocation on a non-zoned filesystem is: 1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was 1 GiB for non-zoned data chunks and not configurable. 2. btrfs_create_chunk calls gather_device_info which consumes and produces more fields of chunk_ctl. 3. gather_device_info multiplies ctl->max_stripe_len by ctl->dev_stripes (which is 1 in all cases except dup) and calls find_free_dev_extent with that number as num_bytes. 4. find_free_dev_extent locates the first dev_extent hole on a device which is at least as large as num_bytes. With default max_chunk_size from f6fca3917b4d, it finds the first hole which is longer than 10 GiB, or the largest hole if that hole is shorter than 10 GiB. This is different from the pre-f6fca3917b4d behavior, where num_bytes is 1 GiB, and find_free_dev_extent may choose a different hole. 5. gather_device_info repeats step 4 with all devices to find the first or largest dev_extent hole that can be allocated on each device. 6. gather_device_info sorts the device list by the hole size on each device, using total unallocated space on each device to break ties, then returns to btrfs_create_chunk with the list. 7. btrfs_create_chunk calls decide_stripe_size_regular. 8. decide_stripe_size_regular finds the largest stripe_len that fits across the first nr_devs device dev_extent holes that were found by gather_device_info (and satisfies other constraints on stripe_len that are not relevant here). 9. decide_stripe_size_regular caps the length of the stripe it computed at 1 GiB. This cap appeared in 5da431b71d4b to correct one of the other regressions introduced in f6fca3917b4d. 10. btrfs_create_chunk creates a new chunk with the above computed size and number of devices. At step 4, gather_device_info() has found a location where stripe up to 10 GiB in length could be allocated on several devices, and selected which devices should have a dev_extent allocated on them, but at step 9, only 1 GiB of the space that was found on each device can be used. This mismatch causes new suboptimal chunk allocation cases that did not occur in pre-f6fca3917b4d kernels. Consider a filesystem using raid1 profile with 3 devices. After some balances, device 1 has 10x 1 GiB unallocated space, while devices 2 and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of space, but distributed across different numbers of dev_extent holes. For visualization, let's ignore all the chunks that were allocated before this point, and focus on the remaining holes: Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated) Device 2: [__________] (10 GiB contig unallocated) Device 3: [__________] (10 GiB contig unallocated) Before f6fca3917b4d, the allocator would fill these optimally by allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3 ([13]), or 2 and 3 ([23]): [after 0 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [__________] (10 GiB) Device 3: [__________] (10 GiB) [after 1 chunk allocation] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] Device 2: [12] [_________] (9 GiB) Device 3: [__________] (10 GiB) [after 2 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [12] [_________] (9 GiB) Device 3: [13] [_________] (9 GiB) [after 3 chunk allocations] Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB) Device 2: [12] [12] [________] (8 GiB) Device 3: [13] [_________] (9 GiB) [...] [after 12 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 13 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 14 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB) [after 15 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full) This allocates all of the space with no waste. The sorting function used by gather_device_info considers free space holes above 1 GiB in length to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently long hole on each device, all the holes appear equal in the sort, and the comparison falls back to sorting devices by total free space. This keeps usable space on each device equal so they can all be filled completely. After f6fca3917b4d, the allocator prefers the devices with larger holes over the devices with more free space, so it makes bad allocation choices: [after 1 chunk allocation] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [_________] (9 GiB) Device 3: [23] [_________] (9 GiB) [after 2 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [________] (8 GiB) Device 3: [23] [23] [________] (8 GiB) [after 3 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [_______] (7 GiB) Device 3: [23] [23] [23] [_______] (7 GiB) [...] [after 9 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 10 chunk allocations] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 11 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full) No further allocations are possible, with 8 GiB wasted (4 GiB of data space). The sort in gather_device_info now considers free space in holes longer than 1 GiB to be distinct, so it will prefer devices 2 and 3 over device 1 until all but 1 GiB is allocated on devices 2 and 3. At that point, with only 1 GiB unallocated on every device, the largest hole length on each device is equal at 1 GiB, so the sort finally moves to ordering the devices with the most free space, but by this time it is too late to make use of the free space on device 1. Note that it's possible to contrive a case where the pre-f6fca3917b4d allocator fails the same way, but these cases generally have extensive dev_extent fragmentation as a precondition (e.g. many holes of 768M in length on one device, and few holes 1 GiB in length on the others). With the regression in f6fca3917b4d, bad chunk allocation can occur even under optimal conditions, when all dev_extent holes are exact multiples of stripe_len in length, as in the example above. Also note that post-f6fca3917b4d kernels do treat dev_extent holes larger than 10 GiB as equal, so the bad behavior won't show up on a freshly formatted filesystem; however, as the filesystem ages and fills up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem can show up as a failure to balance after adding or removing devices, or an unexpected shortfall in available space due to unequal allocation. To fix the regression and make data chunk allocation work again, set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk_size if that's smaller (the latter can happen if the user set space_info->chunk_size to less than 1 GiB via sysfs, or it's a 32 MiB system chunk with a hardcoded chunk_size and stripe_len). While researching the background of the earlier commits, I found that an identical fix was already proposed at: https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/ The previous review missed one detail: ctl->max_stripe_len is used before decide_stripe_size_regular() is called, when it is too late for the changes in that function to have any effect. ctl->max_stripe_len is not used directly by decide_stripe_size_regular(), but the parameter does heavily influence the per-device free space data presented to the function. Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") CC: stable@vger.kernel.org # 6.1+ Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
20218dfb |
|
04-Sep-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make sure to initialize start and len in find_free_dev_extent Jens reported a compiler error when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this In function ‘gather_device_info’, inlined from ‘btrfs_create_chunk’ at fs/btrfs/volumes.c:5507:8: fs/btrfs/volumes.c:5245:48: warning: ‘dev_offset’ may be used uninitialized [-Wmaybe-uninitialized] 5245 | devices_info[ndevs].dev_offset = dev_offset; | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ fs/btrfs/volumes.c: In function ‘btrfs_create_chunk’: fs/btrfs/volumes.c:5196:13: note: ‘dev_offset’ was declared here 5196 | u64 dev_offset; This occurs because find_free_dev_extent is responsible for setting dev_offset, however if we get an -ENOMEM at the top of the function we'll return without setting the value. This isn't actually a problem because we will see the -ENOMEM in gather_device_info() and return and not use the uninitialized value, however we also just don't want the compiler warning so rework the code slightly in find_free_dev_extent() to make sure it's always setting *start and *len to avoid the compiler warning. Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
319baafc |
|
31-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify memcpy either of metadata_uuid or fsid There is a helper which provides either metadata_uuid or fsid as per METADATA_UUID flag. So use it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4844c366 |
|
31-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add a helper to read the superblock metadata_uuid In some cases, we need to read the FSID from the superblock when the metadata_uuid is not set, and otherwise, read the metadata_uuid. So, add a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed8947bc |
|
26-Jul-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: merge find_free_dev_extent() and find_free_dev_extent_start() There is no point in having find_free_dev_extent() because it's just a simple wrapper around find_free_dev_extent_start() which always passes a value of 0 for the search_start argument. Since there are no other callers of find_free_dev_extent_start(), remove find_free_dev_extent() and rename find_free_dev_extent_start() to find_free_dev_extent(), removing its search_start argument because it's always 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
883647f4 |
|
26-Jul-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make find_free_dev_extent() static The function find_free_dev_extent() is only used within volumes.c, so make it static and remove its prototype from volumes.h. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f9879eb |
|
27-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: print name and pid when device scanning processes race There is a race between systemd and mount, as both of them try to register the device in the kernel. When systemd loses the race, it prints the following message: BTRFS error: device /dev/sdb7 belongs to fsid 1b3bacbf-14db-49c9-a3ef-547998aacc4e, and the fs is already mounted. The 'btrfs dev scan' registers one device at a time, so there is no way for the mount thread to wait in the kernel for all the devices to have registered as it won't know if all the devices are discovered. For now, improve the error log by printing the command name and process ID along with the error message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2cc4400 |
|
27-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: simplify the no-bioc fast path condition in btrfs_map_block nr_alloc_stripes can't be one if we are writing to a replacement device, as it is incremented for that case right above. Remove the duplicate checks. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5860f82 |
|
30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make find_first_extent_bit() return a boolean Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed3764f7 |
|
27-Jun-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: add comments for btrfs_map_block() The function btrfs_map_block() is a critical part of the btrfs storage layer, which handles mapping of logical ranges to physical ranges. Thus it's better to have some basic explanation, especially on the following points: - Segment split by various boundaries As a continuous logical range may be split into different segments, due to various factors like zones and RAID0/5/6/10 boundaries. - The meaning of @mirror_num - The possible single stripe optimization - One deprecated parameter @need_raid_map Just explicitly mark it deprecated so we're aware of the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
913e9928 |
|
07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: drop the timespec64 argument from update_time Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
29eefa6d |
|
15-Aug-2023 |
xiaoshoukui <xiaoshoukui@gmail.com> |
btrfs: fix BUG_ON condition in btrfs_cancel_balance Pausing and canceling balance can race to interrupt balance lead to BUG_ON panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance does not take this race scenario into account. However, the race condition has no other side effects. We can fix that. Reproducing it with panic trace like this: kernel BUG at fs/btrfs/volumes.c:4618! RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0 Call Trace: <TASK> ? do_nanosleep+0x60/0x120 ? hrtimer_nanosleep+0xb7/0x1a0 ? sched_core_clone_cookie+0x70/0x70 btrfs_ioctl_balance_ctl+0x55/0x70 btrfs_ioctl+0xa46/0xd20 __x64_sys_ioctl+0x7d/0xa0 do_syscall_64+0x38/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Race scenario as follows: > mutex_unlock(&fs_info->balance_mutex); > -------------------- > .......issue pause and cancel req in another thread > -------------------- > ret = __btrfs_balance(fs_info); > > mutex_lock(&fs_info->balance_mutex); > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) { > btrfs_info(fs_info, "balance: paused"); > btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED); > } CC: stable@vger.kernel.org # 4.19+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e7de35e |
|
27-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block The mirror_num_ret is allowed to be NULL, although it has to be set when smap is set. Unfortunately that is not a well enough specifiable invariant for static type checkers, so add a NULL check to make sure they are fine. Fixes: 03793cbbc80f ("btrfs: add fast path for single device io in __btrfs_map_block") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b19c98f2 |
|
22-Jun-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix race between balance and cancel/pause Syzbot reported a panic that looks like this: assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259 Call Trace: <TASK> btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline] btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline] btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The reproducer is running a balance and a cancel or pause in parallel. The way balance finishes is a bit wonky, if we were paused we need to save the balance_ctl in the fs_info, but clear it otherwise and cleanup. However we rely on the return values being specific errors, or having a cancel request or no pause request. If balance completes and returns 0, but we have a pause or cancel request we won't do the appropriate cleanup, and then the next time we try to start a balance we'll trip this ASSERT. The error handling is just wrong here, we always want to clean up, unless we got -ECANCELLED and we set the appropriate pause flag in the exclusive op. With this patch the reproducer ran for an hour without tripping, previously it would trip in less than a few minutes. Reported-by: syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb091225 |
|
22-Jun-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix remaining u32 overflows when left shifting stripe_nr There was regression caused by a97699d1d610 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr"). To avoid code churn the fix was open coding the type casts but unfortunately missed one which was still possible to hit [1]. The missing place was assignment of bioc->full_stripe_logical inside btrfs_map_block(). Fix it by adding a helper that does the safe calculation of the offset and use it everywhere even though it may not be strictly necessary due to already using u64 types. This replaces all remaining "<< BTRFS_STRIPE_LEN_SHIFT" calls. [1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/ Fixes: a7299a18a179 ("btrfs: fix u32 overflows when left shifting stripe_nr") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a7299a18 |
|
20-Jun-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix u32 overflows when left shifting stripe_nr [BUG] David reported an ASSERT() get triggered during fio load on 8 devices with data/raid6 and metadata/raid1c3: fio --rw=randrw --randrepeat=1 --size=3000m \ --bsrange=512b-64k --bs_unaligned \ --ioengine=libaio --fsync=1024 \ --name=job0 --name=job1 \ The ASSERT() is from rbio_add_bio() of raid56.c: ASSERT(orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN); Which is checking if the target rbio is crossing the full stripe boundary. [100.789] assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1622 [100.795] ------------[ cut here ]------------ [100.796] kernel BUG at fs/btrfs/raid56.c:1622! [100.797] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [100.798] CPU: 1 PID: 100 Comm: kworker/u8:4 Not tainted 6.4.0-rc6-default+ #124 [100.799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 [100.802] Workqueue: writeback wb_workfn (flush-btrfs-1) [100.803] RIP: 0010:rbio_add_bio+0x204/0x210 [btrfs] [100.806] RSP: 0018:ffff888104a8f300 EFLAGS: 00010246 [100.808] RAX: 00000000000000a1 RBX: ffff8881075907e0 RCX: ffffed1020951e01 [100.809] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000001 [100.811] RBP: 0000000141d20000 R08: 0000000000000001 R09: ffff888104a8f04f [100.813] R10: ffffed1020951e09 R11: 0000000000000003 R12: ffff88810e87f400 [100.815] R13: 0000000041d20000 R14: 0000000144529000 R15: ffff888101524000 [100.817] FS: 0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 [100.821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [100.822] CR2: 000055d54e44c270 CR3: 000000010a9a1006 CR4: 00000000003706a0 [100.824] Call Trace: [100.825] <TASK> [100.825] ? die+0x32/0x80 [100.826] ? do_trap+0x12d/0x160 [100.827] ? rbio_add_bio+0x204/0x210 [btrfs] [100.827] ? rbio_add_bio+0x204/0x210 [btrfs] [100.829] ? do_error_trap+0x90/0x130 [100.830] ? rbio_add_bio+0x204/0x210 [btrfs] [100.831] ? handle_invalid_op+0x2c/0x30 [100.833] ? rbio_add_bio+0x204/0x210 [btrfs] [100.835] ? exc_invalid_op+0x29/0x40 [100.836] ? asm_exc_invalid_op+0x16/0x20 [100.837] ? rbio_add_bio+0x204/0x210 [btrfs] [100.837] raid56_parity_write+0x64/0x270 [btrfs] [100.838] btrfs_submit_chunk+0x26e/0x800 [btrfs] [100.840] ? btrfs_bio_init+0x80/0x80 [btrfs] [100.841] ? release_pages+0x503/0x6d0 [100.842] ? folio_unlock+0x2f/0x60 [100.844] ? __folio_put+0x60/0x60 [100.845] ? btrfs_do_readpage+0xae0/0xae0 [btrfs] [100.847] btrfs_submit_bio+0x21/0x60 [btrfs] [100.847] submit_one_bio+0x6a/0xb0 [btrfs] [100.849] extent_write_cache_pages+0x395/0x680 [btrfs] [100.850] ? __extent_writepage+0x520/0x520 [btrfs] [100.851] ? mark_usage+0x190/0x190 [100.852] extent_writepages+0xdb/0x130 [btrfs] [100.853] ? extent_write_locked_range+0x480/0x480 [btrfs] [100.854] ? mark_usage+0x190/0x190 [100.854] ? attach_extent_buffer_page+0x220/0x220 [btrfs] [100.855] ? reacquire_held_locks+0x178/0x280 [100.856] ? writeback_sb_inodes+0x245/0x7f0 [100.857] do_writepages+0x102/0x2e0 [100.858] ? page_writeback_cpu_online+0x10/0x10 [100.859] ? __lock_release.isra.0+0x14a/0x4d0 [100.860] ? reacquire_held_locks+0x280/0x280 [100.861] ? __lock_acquired+0x1e9/0x3d0 [100.862] ? do_raw_spin_lock+0x1b0/0x1b0 [100.863] __writeback_single_inode+0x94/0x450 [100.864] writeback_sb_inodes+0x372/0x7f0 [100.864] ? lock_sync+0xd0/0xd0 [100.865] ? do_raw_spin_unlock+0x93/0xf0 [100.866] ? sync_inode_metadata+0xc0/0xc0 [100.867] ? rwsem_optimistic_spin+0x340/0x340 [100.868] __writeback_inodes_wb+0x70/0x130 [100.869] wb_writeback+0x2d1/0x530 [100.869] ? __writeback_inodes_wb+0x130/0x130 [100.870] ? lockdep_hardirqs_on_prepare.part.0+0xf1/0x1c0 [100.870] wb_do_writeback+0x3eb/0x480 [100.871] ? wb_writeback+0x530/0x530 [100.871] ? mark_lock_irq+0xcd0/0xcd0 [100.872] wb_workfn+0xe0/0x3f0< [CAUSE] Commit a97699d1d610 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN") changes how we calculate the map length, to reduce u64 division. Function btrfs_max_io_len() is to get the length to the stripe boundary. It calculates the full stripe start offset (inside the chunk) by the following code: *full_stripe_start = rounddown(*stripe_nr, nr_data_stripes(map)) << BTRFS_STRIPE_LEN_SHIFT; The calculation itself is fine, but the value returned by rounddown() is dependent on both @stripe_nr (which is u32) and nr_data_stripes() (which returned int). Thus the result is also u32, then we do the left shift, which can overflow u32. If such overflow happens, @full_stripe_start will be a value way smaller than @offset, causing later "full_stripe_len - (offset - *full_stripe_start)" to underflow, thus make later length calculation to have no stripe boundary limit, resulting a write bio to exceed stripe boundary. There are some other locations like this, with a u32 @stripe_nr got left shift, which can lead to a similar overflow. [FIX] Fix all @stripe_nr with left shift with a type cast to u64 before the left shift. Those involved @stripe_nr or similar variables are recording the stripe number inside the chunk, which is small enough to be contained by u32, but their offset inside the chunk can not fit into u32. Thus for those specific left shifts, a type cast to u64 is necessary so this patch does not touch them and the code will be cleaned up in the future to keep the fix minimal. Reported-by: David Sterba <dsterba@suse.com> Fixes: a97699d1d610 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN") Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c9e561c4 |
|
15-Jun-2023 |
Jeff Layton <jlayton@kernel.org> |
btrfs: update i_version in update_dev_time When updating the ctime, we also want to update i_version. This is just something I noticed by inspection. There is probably no way to test this today unless you can somehow get to this inode via nfsd. Still, I think it's the right thing to do for consistency's sake. David Sterba's comment: I don't see anything wrong with setting the iversion bit, however I also don't see where this would be useful. Agreed with the consistency, otherwise the time is updated when device super block is wiped or a device initialized, both are big events so missing that due to lack of iversion update seems unlikely. I'll add it to the queue, thanks. Signed-off-by: Jeff Layton <jlayton@kernel.org> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8680e587 |
|
30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: open code need_full_stripe conditions need_full_stripe is just a somewhat complicated way to say "op != BTRFS_MAP_READ". Just spell that explicit check out, which makes a lot of the code currently using the helper easier to understand. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
723b8bb1 |
|
30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: open code btrfs_map_sblock btrfs_map_sblock just hard codes three arguments and calls btrfs_map_sblock. Remove it as it doesn't provide any real value, but makes following the btrfs_map_block call chains harder. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cd4efd21 |
|
30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: rename __btrfs_map_block to btrfs_map_block Now that the old btrfs_map_block is gone, drop the leading underscores from __btrfs_map_block. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d69d7ffc |
|
30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove unused btrfs_map_block There are no users of btrfs_map_block left, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3965a4c7 |
|
30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove unused BTRFS_MAP_DISCARD BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to btrfs_op() only only checked in two ASSERTS. Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06f188 ("btrfs: split discard handling out of btrfs_map_block"). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a3c54b0b |
|
24-May-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify how changed fsid and metadata_uuid is checked We often check if the metadata_uuid is not the same as fsid, and then we check if the given fsid matches the metadata_uuid. This patch refactors this logic into function match_fsid_changed and utilize it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a898345 |
|
24-May-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify fsid and metadata_uuid comparisons Refactor the functions find_fsid() and find_fsid_with_metadata_uuid(), as they currently share a common set of code to compare the fsid and metadata_uuid. Create a common helper function, match_fsid_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6930d7d |
|
24-May-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: merge calls to alloc_fs_devices in device_list_add Simplify has_metadata_uuid checks - by localizing the has_metadata_uuid checked within alloc_fs_devices()'s second argument, it improves the code readability. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19c4c49c |
|
24-May-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: streamline fsid checks in alloc_fs_devices We currently have redundant checks for the non-null value of fsid simplify it. And, no one is using alloc_fs_devices() with a NULL metadata_uuid while fsid is not NULL, add an assert() to verify this condition. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2db4d5c |
|
26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make btrfs_free_device() static The function btrfs_free_device() is never used outside of volumes.c, so make it static and remove its prototype declaration at volumes.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05bdb996 |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: replace fmode_t with a block-specific type for block open flags The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2736e8ee |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: use the holder as indication for exclusive opens The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2ef78928 |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't pass a holder for non-exclusive blkdev_get_by_path Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't make sense, so pass NULL instead and remove the holder argument from the call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls. Exclusive mode for device scanning is not used since commit 50d281fc434c ("btrfs: scan device in non-exclusive mode")". Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0718afd4 |
|
01-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
611ccc58 |
|
26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix leak of source device allocation state after device replace When a device replace finishes, the source device is freed by calling btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the allocation state, tracked in the device's alloc_state io tree, is never freed. This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state"), which removed a call to extent_io_tree_release() from btrfs_free_device(), with the rationale that btrfs_close_one_device() already releases the allocation state from a device and btrfs_close_one_device() is always called before a device is freed with btrfs_free_device(). However that is not true for the device replace case, as btrfs_free_device() is called without any previous call to btrfs_close_one_device(). The issue is trivial to reproduce, for example, by running test btrfs/027 from fstests: $ ./check btrfs/027 $ rmmod btrfs $ dmesg (...) [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 So fix this by adding back the call to extent_io_tree_release() at btrfs_free_device(). Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ba7d5f5 |
|
23-Mar-2023 |
Genjian Zhang <zhanggenjian@kylinos.cn> |
btrfs: fix uninitialized variable warnings There are some warnings on older compilers (gcc 10, 7) or non-x86_64 architectures (aarch64). As btrfs wants to enable -Wmaybe-uninitialized by default, fix the warnings even though it's not necessary on recent compilers (gcc 12+). ../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’: ../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 2703 | btrfs_setup_sprout(fs_info, seed_devices); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../fs/btrfs/send.c: In function ‘get_cur_inode_state’: ../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 70 | (__if_trace.miss_hit[1]++,1) : \ | ^ ../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here 1878 | u64 right_gen; | ^~~~~~~~~ Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4886ff7b |
|
19-Mar-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce a new helper to submit write bio for repair Both scrub and read-repair are utilizing a special repair writes that: - Only writes back to a single device Even for read-repair on RAID56, we only update the corrupted data stripe itself, not triggering the full RMW path. - Requires a valid @mirror_num For RAID56 case, only @mirror_num == 1 is valid. For non-RAID56 cases, we need @mirror_num to locate our stripe. - No data csum generation needed These two call sites still have some differences though: - Read-repair goes plain bio It doesn't need a full btrfs_bio, and goes submit_bio_wait(). - New scrub repair would go btrfs_bio To simplify both read and write path. So here this patch would: - Introduce a common helper, btrfs_map_repair_block() Due to the single device nature, we can use an on-stack btrfs_io_stripe to pass device and its physical bytenr. - Introduce a new interface, btrfs_submit_repair_bio(), for later scrub code This is for the incoming scrub code. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0bb5474 |
|
04-Apr-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove redundant release of btrfs_device::alloc_state Commit 321f69f86a0f ("btrfs: reset device back to allocation state when removing") included adding extent_io_tree_release(&device->alloc_state) to btrfs_close_one_device(), which had already been called in btrfs_free_device(). The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore, the additional call to extent_io_tree_release(&device->alloc_state) in btrfs_free_device() is unnecessary and can be removed. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1f16033c |
|
04-Apr-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: warn for any missed cleanup at btrfs_close_one_device During my recent search for the root cause of a reported bug, I realized that it's a good idea to issue a warning for missed cleanup instead of using debug-only assertions. Since most installations run with debug off, missed cleanups and premature calls to close could go unnoticed. However, these issues are serious enough to warrant reporting and fixing. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5758d1bd |
|
21-Mar-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove bytes_used argument from btrfs_make_block_group() The only caller of btrfs_make_block_group() always passes 0 as the value for the bytes_used argument, so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f50fa91 |
|
22-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not use replace target device as an extra mirror [BUG] Currently btrfs can use dev-replace device as an extra mirror for read-repair. But it can lead to NODATASUM corruption in the following case: There is a RAID1 data chunk, and dev-replace is running from dev2 to dev0. |//| = Replaced data X X+1MB X+2MB Dev 2: | | | <- Source dev Dev 0: |///////| | <- Target dev Then a read on dev 2 X+2MB happens. And something wrong happened inside devid 2, causing an -EIO. In that case, read-repair would try the next mirror, and since we can use target device as an extra mirror, we will use that mirror instead. But unfortunately since the read is beyond the current replace cursor, we should not trust it at all, what we get would be just uninitialized garbage. But if this read is for NODATASUM range, then we just trust them and cause data corruption. [CAUSE] We used to have some checks to make sure we only return such extra mirror when the range is before our left cursor. The first commit introducing this behavior is ad6d620e2a57 ("Btrfs: allow repair code to include target disk when searching mirrors"). But later a fix, 22ab04e814f4 ("Btrfs: fix race between device replace and chunk allocation") changed the behavior, to always let btrfs_map_block() include the extra mirror to address a race in dev-replace which can cause missing writes to target device. This means, we lose the tracking of cursor for the extra mirror, thus can lead to above corruption. [FIX] The extra mirror is never a reliable one, at the beginning of dev-replace, the reliability is zero, while only at the end of the replace it's a fully reliable mirror. We either do the complex tracking, or never trust it. IMHO it's much easier to maintain if we don't trust it at all, and the extra mirror can only benefit for a limited period of time (during replace). Thus this patch would completely remove the ability to use target device as an extra mirror. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18d758a2 |
|
16-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: replace btrfs_io_context::raid_map with a fixed u64 value In btrfs_io_context structure, we have a pointer raid_map, which indicates the logical bytenr for each stripe. But considering we always call sort_parity_stripes(), the result raid_map[] is always sorted, thus raid_map[0] is always the logical bytenr of the full stripe. So why we waste the space and time (for sorting) for raid_map? This patch will replace btrfs_io_context::raid_map with a single u64 number, full_stripe_start, by: - Replace btrfs_io_context::raid_map with full_stripe_start - Replace call sites using raid_map[0] to use full_stripe_start - Replace call sites using raid_map[i] to compare with nr_data_stripes. The benefits are: - Less memory wasted on raid_map It's sizeof(u64) * num_stripes vs sizeof(u64). It'll always save at least one u64, and the benefit grows larger with num_stripes. - No more weird alloc_btrfs_io_context() behavior As there is only one fixed size + one variable length array. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1faf3885 |
|
06-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: use an efficient way to represent source of duplicated stripes For btrfs dev-replace, we have to duplicate writes to the source device into the target device. For non-RAID56, all writes into the same mapped ranges are sharing the same content, thus they don't really need to bother anything. (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the same write to all involved devices). But for RAID56, all stripes contain different content, thus we must have a clear mapping of which stripe is duplicated from which original stripe. Currently we use a complex way using tgtdev_map[] array, e.g: num_tgtdevs = 1 tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace. tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace, and it's duplicated to stripes[3]. tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace. But this is wasting some space, and ignores one important thing for dev-replace, there is at most one running replace. Thus we can change it to a fixed array to represent the mapping: replace_nr_stripes = 1 replace_stripe_src = 1 <- Means stripes[1] is involved in replace. thus the extra stripe is a copy of stripes[1] By this we can save some space for bioc on RAID56 chunks with many devices. And we get rid of one variable sized array from bioc. Thus the patch involves the following changes: - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes and @replace_stripe_src. @num_tgtdevs is just renamed to @replace_nr_stripes. While the mapping is completely changed. - Add extra ASSERT()s for RAID56 code - Only add two more extra stripes for dev-replace cases. As we have an upper limit on how many dev-replace stripes we can have. - Unify the behavior of handle_ops_on_dev_replace() Previously handle_ops_on_dev_replace() go two different paths for WRITE and GET_READ_MIRRORS. Now unify them by always going the WRITE path first (with at most 2 replace stripes), then if we're doing GET_READ_MIRRORS and we have 2 extra stripes, just drop one stripe. - Remove the @real_stripes argument from alloc_btrfs_io_context() As we don't need the old variable length array any more. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ced85f8 |
|
06-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: reduce type width of btrfs_io_contexts That structure is our ultimate object for all __btrfs_map_block() related functions. We have some hard to understand members, like tgtdev_map, but without any comments. This patch will improve the situation: - Add extra comments for num_stripes, mirror_num, num_tgtdevs and tgtdev_map[] Especially for the last two members, add a dedicated (thus very long) comments for them, with example to explain it. - Shrink those int members to u16. In fact our on-disk format is only using u16 for num_stripes, thus no need to use int at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
be5c7edb |
|
06-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: simplify the bioc argument for handle_ops_on_dev_replace() There is no memory re-allocation for handle_ops_on_dev_replace(), thus we don't need to pass a btrfs_io_context pointer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ded22c1 |
|
16-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 There are quite some div64 calls inside btrfs_map_block() and its variants. Such calls are for @stripe_nr, where @stripe_nr is the number of stripes before our logical bytenr inside a chunk. However we can eliminate such div64 calls by just reducing the width of @stripe_nr from 64 to 32. This can be done because our chunk size limit is already 10G, with fixed stripe length 64K. Thus a U32 is definitely enough to contain the number of stripes. With such width reduction, we can get rid of slower div64, and extra warning for certain 32bit arch. This patch would do: - Add a new tree-checker chunk validation on chunk length Make sure no chunk can reach 256G, which can also act as a bitflip checker. - Reduce the width from u64 to u32 for @stripe_nr variables - Replace unnecessary div64 calls with regular modulo and division 32bit division and modulo are much faster than 64bit operations, and we are finally free of the div64 fear at least in those involved functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a97699d1 |
|
16-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN Currently btrfs doesn't support stripe lengths other than 64KiB. This is already set in the tree-checker. There is really no meaning to record that fixed value in map_lookup for now, and can all be replaced with BTRFS_STRIPE_LEN. Furthermore we can use the fix stripe length to do the following optimization: - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division Now we only need to do a right shift. And the value of BTRFS_STRIPE_LEN itself is already too large for bit shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift, a compiler warning would be triggered. Thus this bit shift optimization would be safe. - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2d82a40a |
|
22-Mar-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when aborting transaction during relocation with scrub Before relocating a block group we pause scrub, then do the relocation and then unpause scrub. The relocation process requires starting and committing a transaction, and if we have a failure in the critical section of the transaction commit path (transaction state >= TRANS_STATE_COMMIT_START), we will deadlock if there is a paused scrub. That results in stack traces like the following: [42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6 [42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction. [42.936] ------------[ cut here ]------------ [42.936] BTRFS: Transaction aborted (error -28) [42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs] [42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...) [42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs] [42.936] Code: ff ff 45 8b (...) [42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282 [42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000 [42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff [42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8 [42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00 [42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0 [42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000 [42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0 [42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42.936] Call Trace: [42.936] <TASK> [42.936] ? start_transaction+0xcb/0x610 [btrfs] [42.936] prepare_to_relocate+0x111/0x1a0 [btrfs] [42.936] relocate_block_group+0x57/0x5d0 [btrfs] [42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs] [42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs] [42.936] ? __pfx_autoremove_wake_function+0x10/0x10 [42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs] [42.936] btrfs_balance+0x8ff/0x11d0 [btrfs] [42.936] ? __kmem_cache_alloc_node+0x14a/0x410 [42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs] [42.937] ? mod_objcg_state+0xd2/0x360 [42.937] ? refill_obj_stock+0xb0/0x160 [42.937] ? seq_release+0x25/0x30 [42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0 [42.937] ? percpu_counter_add_batch+0x2e/0xa0 [42.937] ? __x64_sys_ioctl+0x88/0xc0 [42.937] __x64_sys_ioctl+0x88/0xc0 [42.937] do_syscall_64+0x38/0x90 [42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc [42.937] RIP: 0033:0x7f381a6ffe9b [42.937] Code: 00 48 89 44 24 (...) [42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b [42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003 [42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000 [42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423 [42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148 [42.937] </TASK> [42.937] ---[ end trace 0000000000000000 ]--- [42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left [59.196] INFO: task btrfs:346772 blocked for more than 120 seconds. [59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002 [59.196] Call Trace: [59.196] <TASK> [59.196] __schedule+0x392/0xa70 [59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370 [59.196] schedule+0x5d/0xd0 [59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs] [59.197] ? __pfx_autoremove_wake_function+0x10/0x10 [59.197] scrub_pause_off+0x21/0x50 [btrfs] [59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs] [59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs] [59.198] ? __pfx_autoremove_wake_function+0x10/0x10 [59.198] scrub_stripe+0x20d/0x740 [btrfs] [59.198] scrub_chunk+0xc4/0x130 [btrfs] [59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs] [59.198] ? __pfx_autoremove_wake_function+0x10/0x10 [59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs] [59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs] [59.199] ? _copy_from_user+0x7b/0x80 [59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs] [59.199] ? refill_stock+0x33/0x50 [59.199] ? should_failslab+0xa/0x20 [59.199] ? kmem_cache_alloc_node+0x151/0x460 [59.199] ? alloc_io_context+0x1b/0x80 [59.199] ? preempt_count_add+0x70/0xa0 [59.199] ? __x64_sys_ioctl+0x88/0xc0 [59.199] __x64_sys_ioctl+0x88/0xc0 [59.199] do_syscall_64+0x38/0x90 [59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.199] RIP: 0033:0x7f82ffaffe9b [59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b [59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003 [59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000 [59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640 [59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000 [59.199] </TASK> [59.199] INFO: task btrfs:346773 blocked for more than 120 seconds. [59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002 [59.201] Call Trace: [59.201] <TASK> [59.201] __schedule+0x392/0xa70 [59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370 [59.201] schedule+0x5d/0xd0 [59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs] [59.201] ? __pfx_autoremove_wake_function+0x10/0x10 [59.201] scrub_pause_off+0x21/0x50 [btrfs] [59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs] [59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs] [59.202] ? __pfx_autoremove_wake_function+0x10/0x10 [59.202] scrub_stripe+0x20d/0x740 [btrfs] [59.202] scrub_chunk+0xc4/0x130 [btrfs] [59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs] [59.203] ? __pfx_autoremove_wake_function+0x10/0x10 [59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs] [59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs] [59.203] ? _copy_from_user+0x7b/0x80 [59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs] [59.204] ? should_failslab+0xa/0x20 [59.204] ? kmem_cache_alloc_node+0x151/0x460 [59.204] ? alloc_io_context+0x1b/0x80 [59.204] ? preempt_count_add+0x70/0xa0 [59.204] ? __x64_sys_ioctl+0x88/0xc0 [59.204] __x64_sys_ioctl+0x88/0xc0 [59.204] do_syscall_64+0x38/0x90 [59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.204] RIP: 0033:0x7f82ffaffe9b [59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b [59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003 [59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000 [59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640 [59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000 [59.204] </TASK> [59.204] INFO: task btrfs:346774 blocked for more than 120 seconds. [59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002 [59.206] Call Trace: [59.206] <TASK> [59.206] __schedule+0x392/0xa70 [59.206] schedule+0x5d/0xd0 [59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs] [59.206] ? __pfx_autoremove_wake_function+0x10/0x10 [59.206] scrub_pause_off+0x21/0x50 [btrfs] [59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs] [59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs] [59.207] ? __pfx_autoremove_wake_function+0x10/0x10 [59.207] scrub_stripe+0x20d/0x740 [btrfs] [59.208] scrub_chunk+0xc4/0x130 [btrfs] [59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs] [59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120 [59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs] [59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs] [59.209] ? _copy_from_user+0x7b/0x80 [59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs] [59.209] ? should_failslab+0xa/0x20 [59.209] ? kmem_cache_alloc_node+0x151/0x460 [59.209] ? alloc_io_context+0x1b/0x80 [59.209] ? preempt_count_add+0x70/0xa0 [59.209] ? __x64_sys_ioctl+0x88/0xc0 [59.209] __x64_sys_ioctl+0x88/0xc0 [59.209] do_syscall_64+0x38/0x90 [59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.209] RIP: 0033:0x7f82ffaffe9b [59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b [59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003 [59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000 [59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640 [59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000 [59.209] </TASK> [59.209] INFO: task btrfs:346775 blocked for more than 120 seconds. [59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002 [59.211] Call Trace: [59.211] <TASK> [59.211] __schedule+0x392/0xa70 [59.211] schedule+0x5d/0xd0 [59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs] [59.211] ? __pfx_autoremove_wake_function+0x10/0x10 [59.211] scrub_pause_off+0x21/0x50 [btrfs] [59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs] [59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs] [59.212] ? __pfx_autoremove_wake_function+0x10/0x10 [59.212] scrub_stripe+0x20d/0x740 [btrfs] [59.213] scrub_chunk+0xc4/0x130 [btrfs] [59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs] [59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120 [59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs] [59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs] [59.214] ? _copy_from_user+0x7b/0x80 [59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs] [59.214] ? should_failslab+0xa/0x20 [59.214] ? kmem_cache_alloc_node+0x151/0x460 [59.214] ? alloc_io_context+0x1b/0x80 [59.214] ? preempt_count_add+0x70/0xa0 [59.214] ? __x64_sys_ioctl+0x88/0xc0 [59.214] __x64_sys_ioctl+0x88/0xc0 [59.214] do_syscall_64+0x38/0x90 [59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.214] RIP: 0033:0x7f82ffaffe9b [59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b [59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003 [59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000 [59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640 [59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000 [59.214] </TASK> [59.214] INFO: task btrfs:346776 blocked for more than 120 seconds. [59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002 [59.217] Call Trace: [59.217] <TASK> [59.217] __schedule+0x392/0xa70 [59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370 [59.217] schedule+0x5d/0xd0 [59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs] [59.217] ? __pfx_autoremove_wake_function+0x10/0x10 [59.217] scrub_pause_off+0x21/0x50 [btrfs] [59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs] [59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs] [59.218] ? __pfx_autoremove_wake_function+0x10/0x10 [59.218] scrub_stripe+0x20d/0x740 [btrfs] [59.218] scrub_chunk+0xc4/0x130 [btrfs] [59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs] [59.219] ? __pfx_autoremove_wake_function+0x10/0x10 [59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs] [59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs] [59.219] ? _copy_from_user+0x7b/0x80 [59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs] [59.219] ? should_failslab+0xa/0x20 [59.219] ? kmem_cache_alloc_node+0x151/0x460 [59.219] ? alloc_io_context+0x1b/0x80 [59.219] ? preempt_count_add+0x70/0xa0 [59.219] ? __x64_sys_ioctl+0x88/0xc0 [59.219] __x64_sys_ioctl+0x88/0xc0 [59.219] do_syscall_64+0x38/0x90 [59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.219] RIP: 0033:0x7f82ffaffe9b [59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b [59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003 [59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000 [59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640 [59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000 [59.219] </TASK> [59.219] INFO: task btrfs:346822 blocked for more than 120 seconds. [59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1 [59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002 [59.222] Call Trace: [59.222] <TASK> [59.222] __schedule+0x392/0xa70 [59.222] schedule+0x5d/0xd0 [59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs] [59.222] ? __pfx_autoremove_wake_function+0x10/0x10 [59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs] [59.223] ? start_transaction+0xcb/0x610 [btrfs] [59.223] prepare_to_relocate+0x111/0x1a0 [btrfs] [59.223] relocate_block_group+0x57/0x5d0 [btrfs] [59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs] [59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs] [59.224] ? __pfx_autoremove_wake_function+0x10/0x10 [59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs] [59.224] btrfs_balance+0x8ff/0x11d0 [btrfs] [59.224] ? __kmem_cache_alloc_node+0x14a/0x410 [59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs] [59.225] ? mod_objcg_state+0xd2/0x360 [59.225] ? refill_obj_stock+0xb0/0x160 [59.225] ? seq_release+0x25/0x30 [59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0 [59.225] ? percpu_counter_add_batch+0x2e/0xa0 [59.225] ? __x64_sys_ioctl+0x88/0xc0 [59.225] __x64_sys_ioctl+0x88/0xc0 [59.225] do_syscall_64+0x38/0x90 [59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc [59.225] RIP: 0033:0x7f381a6ffe9b [59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b [59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003 [59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000 [59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423 [59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148 [59.225] </TASK> What happens is the following: 1) A scrub is running, so fs_info->scrubs_running is 1; 2) Task A starts block group relocation, and at btrfs_relocate_chunk() it pauses scrub by calling btrfs_scrub_pause(). That increments fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running); 3) The scrub task pauses at scrub_pause_off(), waiting for fs_info->scrub_pause_req to decrease to 0; 4) Task A then enters btrfs_relocate_block_group(), and down that call chain we start a transaction and then attempt to commit it; 5) When task A calls btrfs_commit_transaction(), it either will do the commit itself or wait for some other task that already started the commit of the transaction - it doesn't matter which case; 6) The transaction commit enters state TRANS_STATE_COMMIT_START; 7) An error happens during the transaction commit, like -ENOSPC when running delayed refs or delayed items for example; 8) This results in calling transaction.c:cleanup_transaction(), where we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req from 0 to 1, and blocking this task waiting for fs_info->scrubs_running to decrease to 0; 9) From this point on, both the transaction commit and the scrub task hang forever: 1) The transaction commit is waiting for fs_info->scrubs_running to be decreased to 0; 2) The scrub task is at scrub_pause_off() waiting for fs_info->scrub_pause_req to decrease to 0 - so it can not proceed to stop the scrub and decrement fs_info->scrubs_running from 0 to 1. Therefore resulting in a deadlock. Fix this by having cleanup_transaction(), called if a transaction commit fails, not call btrfs_scrub_cancel() if relocation is in progress, and having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if the relocation failed and a transaction abort happened. This was triggered with btrfs/061 from fstests. Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50d281fc |
|
23-Mar-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: scan device in non-exclusive mode This fixes mkfs/mount/check failures due to race with systemd-udevd scan. During the device scan initiated by systemd-udevd, other user space EXCL operations such as mkfs, mount, or check may get blocked and result in a "Device or resource busy" error. This is because the device scan process opens the device with the EXCL flag in the kernel. Two reports were received: - btrfs/179 test case, where the fsck command failed with the -EBUSY error - LTP pwritev03 test case, where mkfs.vfs failed with the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem on the device. In both cases, fsck and mkfs (respectively) were racing with a systemd-udevd device scan, and systemd-udevd won, resulting in the -EBUSY error for fsck and mkfs. Reproducing the problem has been difficult because there is a very small window during which these userspace threads can race to acquire the exclusive device open. Even on the system where the problem was observed, the problem occurrences were anywhere between 10 to 400 iterations and chances of reproducing decreases with debug printk()s. However, an exclusive device open is unnecessary for the scan process, as there are no write operations on the device during scan. Furthermore, during the mount process, the superblock is re-read in the below function call chain: btrfs_mount_root btrfs_open_devices open_fs_devices btrfs_open_one_device btrfs_get_bdev_and_sb So, to fix this issue, removes the FMODE_EXCL flag from the scan operation, and add a comment. The case where mkfs may still write to the device and a scan is running, the btrfs signature is not written at that time so scan will not recognize such device. Reported-by: Sherry Yang <sherry.yang@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c3ab6df |
|
01-Mar-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: handle missing chunk mapping more gracefully [BUG] During my scrub rework, I did a stupid thing like this: bio->bi_iter.bi_sector = stripe->logical; btrfs_submit_bio(fs_info, bio, stripe->mirror_num); Above bi_sector assignment is using logical address directly, which lacks ">> SECTOR_SHIFT". This results a read on a range which has no chunk mapping. This results the following crash: BTRFS critical (device dm-1): unable to find logical 11274289152 length 65536 assertion failed: !IS_ERR(em), in fs/btrfs/volumes.c:6387 Sure this is all my fault, but this shows a possible problem in real world, that some bit flip in file extents/tree block can point to unmapped ranges, and trigger above ASSERT(), or if CONFIG_BTRFS_ASSERT is not configured, cause invalid pointer access. [PROBLEMS] In the above call chain, we just don't handle the possible error from btrfs_get_chunk_map() inside __btrfs_map_block(). [FIX] The fix is straightforward, replace the ASSERT() with proper error handling (callers handle errors already). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f8a02dc6 |
|
20-Jan-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove struct btrfs_io_geometry Now that btrfs_get_io_geometry has a single caller, we can massage it into a form that is more suitable for that caller and remove the marshalling into and out of struct btrfs_io_geometry. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67da05b3 |
|
17-Jan-2023 |
Colin Ian King <colin.i.king@gmail.com> |
btrfs: fix spelling mistakes found using codespell There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f58d783 |
|
20-Jan-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: free device in btrfs_close_devices for a single device filesystem We have this check to make sure we don't accidentally add older devices that may have disappeared and re-appeared with an older generation from being added to an fs_devices (such as a replace source device). This makes sense, we don't want stale disks in our file system. However for single disks this doesn't really make sense. I've seen this in testing, but I was provided a reproducer from a project that builds btrfs images on loopback devices. The loopback device gets cached with the new generation, and then if it is re-used to generate a new file system we'll fail to mount it because the new fs is "older" than what we have in cache. Fix this by freeing the cache when closing the device for a single device filesystem. This will ensure that the mount command passed device path is scanned successfully during the next mount. CC: stable@vger.kernel.org # 5.10+ Reported-by: Daan De Meyer <daandemeyer@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c538de0 |
|
18-Jan-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: limit device extents to the device size There was a recent regression in btrfs/177 that started happening with the size class patches ("btrfs: introduce size class to block group allocator"). This however isn't a regression introduced by those patches, but rather the bug was uncovered by a change in behavior in these patches. The patches triggered more chunk allocations in the ^free-space-tree case, which uncovered a race with device shrink. The problem is we will set the device total size to the new size, and use this to find a hole for a device extent. However during shrink we may have device extents allocated past this range, so we could potentially find a hole in a range past our new shrink size. We don't actually limit our found extent to the device size anywhere, we assume that we will not find a hole past our device size. This isn't true with shrink as we're relocating block groups and thus creating holes past the device size. Fix this by making sure we do not search past the new device size, and if we wander into any device extents that start after our device size simply break from the loop and use whatever hole we've already found. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
26ecf243 |
|
21-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: stop using write_one_page in btrfs_scratch_superblock write_one_page is an awkward interface that expects the page locked and ->writepage to be implemented. Replace that by zeroing the signature bytes and synchronize the block device page using the proper bdev helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e0078f7 |
|
21-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: factor out scratching of one regular super block btrfs_scratch_superblocks open codes scratching super block of a non-zoned super block. Split the code to read, zero and write the superblock for regular devices into a separate helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed02363f |
|
11-Dec-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: add extra error messages to cover non-ENOMEM errors from device_add_list() [BUG] When test case btrfs/219 (aka, mount a registered device but with a lower generation) failed, there is not any useful information for the end user to find out what's going wrong. The mount failure just looks like this: # mount -o loop /tmp/219.img2 /mnt/btrfs/ mount: /mnt/btrfs: mount(2) system call failed: File exists. dmesg(1) may have more information after failed mount system call. While the dmesg contains nothing but the loop device change: loop1: detected capacity change from 0 to 524288 [CAUSE] In device_list_add() we have a lot of extra checks to reject invalid cases. That function also contains the regular device scan result like the following prompt: BTRFS: device fsid 6222333e-f9f1-47e6-b306-55ddd4dcaef4 devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (3027) But unfortunately not all errors have their own error messages, thus if we hit something wrong in device_add_list(), there may be no error messages at all. [FIX] Add errors message for all non-ENOMEM errors. For ENOMEM, I'd say we're in a much worse situation, and there should be some OOM messages way before our call sites. CC: stable@vger.kernel.org # 6.0+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1742e1c9 |
|
23-Nov-2022 |
void0red <void0red@gmail.com> |
btrfs: fix extent map use-after-free when handling missing device in read_one_chunk Store the error code before freeing the extent_map. Though it's reference counted structure, in that function it's the first and last allocation so this would lead to a potential use-after-free. The error can happen eg. when chunk is stored on a missing device and the degraded mount option is missing. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721 Reported-by: eriri <1527030098@qq.com> Fixes: adfb69af7d8c ("btrfs: add_missing_dev() should return the actual error") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: void0red <void0red@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
103c1972 |
|
15-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: split the bio submission path into a separate file The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f0eac07 |
|
25-Oct-2022 |
Li zeming <zeming@nfschina.com> |
btrfs: allocate btrfs_io_context without GFP_NOFAIL The __GFP_NOFAIL flag could loop indefinitely when allocation memory in alloc_btrfs_io_context. The callers starting from __btrfs_map_block already handle errors so it's safe to drop the flag. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb3e217b |
|
12-Nov-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use btrfs_dev_name() helper to handle missing devices better [BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb21e302 |
|
07-Nov-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move device->name RCU allocation and assign to btrfs_alloc_device() There is a repeating code section in the parent function after calling btrfs_alloc_device(), as below: name = rcu_string_strdup(path, GFP_...); if (!name) { btrfs_free_device(device); return ERR_PTR(-ENOMEM); } rcu_assign_pointer(device->name, name); Except in add_missing_dev() for obvious reasons. This patch consolidates that repeating code into the btrfs_alloc_device() itself so that the parent function doesn't have to duplicate code. This consolidation also helps to review issues regarding RCU lock violation with device->name. Parent function device_list_add() and add_missing_dev() use GFP_NOFS for the allocation, whereas the rest of the parent functions use GFP_KERNEL, so bring the NOFS allocation context using memalloc_nofs_save() in the function device_list_add() and add_missing_dev() is already doing it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
35da5a7e |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: drop private_data parameter from extent_io_tree_init All callers except one pass NULL, so the parameter can be dropped and the inode::io_tree initialization can be open coded. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
428c8e03 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: simplify percent calculation helpers, rename div_factor The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f0add25 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move super_block specific helpers into super.h This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fc6822c |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move scrub prototypes into scrub.h Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67707479 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move relocation prototypes into relocation.h Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7572dec8 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move ioctl prototypes into ioctl.h Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7a03b52 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move uuid tree prototypes to uuid-tree.h Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43dd529a |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07e81dc9 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7f13d42 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
63a7cb13 |
|
26-Jul-2022 |
David Sterba <dsterba@suse.com> |
btrfs: auto enable discard=async when possible There's a request to automatically enable async discard for capable devices. We can do that, the async mode is designed to wait for larger freed extents and is not intrusive, with limits to iops, kbps or latency. The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . The automatic selection is done if there's at least one discard capable device in the filesystem (not capable devices are skipped). Mounting with any other discard option will honor that option, notably mounting with nodiscard will keep it disabled. Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a8d1b164 |
|
04-Nov-2022 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: initialize device's zone info for seeding When performing seeding on a zoned filesystem it is necessary to initialize each zoned device's btrfs_zoned_device_info structure, otherwise mounting the filesystem will cause a NULL pointer dereference. This was uncovered by fstests' testcase btrfs/163. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
21e61ec6 |
|
04-Nov-2022 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: clone zoned device info when cloning a device When cloning a btrfs_device, we're not cloning the associated btrfs_zoned_device_info structure of the device in case of a zoned filesystem. Later on this leads to a NULL pointer dereference when accessing the device's zone_info for instance when setting a zone as active. This was uncovered by fstests' testcase btrfs/161. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0fca385d |
|
03-Nov-2022 |
Liu Shixin <liushixin2@huawei.com> |
btrfs: fix match incorrectly in dev_args_match_device syzkaller found a failed assertion: assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921 This can be triggered when we set devid to (u64)-1 by ioctl. In this case, the match of devid will be skipped and the match of device may succeed incorrectly. Patch 562d7b1512f7 introduced this function which is used to match device. This function contains two matching scenarios, we can distinguish them by checking the value of args->missing rather than check whether args->devid and args->uuid is default value. Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com Fixes: 562d7b1512f7 ("btrfs: handle device lookup with btrfs_dev_lookup_args") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
76a66ba1 |
|
20-Oct-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't use btrfs_chunk::sub_stripes from disk [BUG] There are two reports (the earliest one from LKP, a more recent one from kernel bugzilla) that we can have some chunks with 0 as sub_stripes. This will cause divide-by-zero errors at btrfs_rmap_block, which is introduced by a recent kernel patch ac0677348f3c ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block"): if (map->type & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10)) { stripe_nr = stripe_nr * map->num_stripes + i; stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<< } [CAUSE] From the more recent report, it has been proven that we have some chunks with 0 as sub_stripes, mostly caused by older mkfs. It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa ("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk") which is included in v5.4 btrfs-progs release. So there would be quite some old filesystems with such 0 sub_stripes. [FIX] Just don't trust the sub_stripes values from disk. We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes numbers for each profile and that are fixed. By this, we can keep the compatibility with older filesystems while still avoid divide-by-zero bugs. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Viktor Kuzmin <kvaster@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559 Fixes: ac0677348f3c ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block") CC: stable@vger.kernel.org # 6.0 Reviewed-by: Su Yue <glass@fydeos.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a05d3c91 |
|
24-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: check superblock to ensure the fs was not modified at thaw time [BACKGROUND] There is an incident report that, one user hibernated the system, with one btrfs on removable device still mounted. Then by some incident, the btrfs got mounted and modified by another system/OS, then back to the hibernated system. After resuming from the hibernation, new write happened into the victim btrfs. Now the fs is completely broken, since the underlying btrfs is no longer the same one before the hibernation, and the user lost their data due to various transid mismatch. [REPRODUCER] We can emulate the situation using the following small script: truncate -s 1G $dev mkfs.btrfs -f $dev mount $dev $mnt fsstress -w -d $mnt -n 500 sync xfs_freeze -f $mnt cp $dev $dev.backup # There is no way to mount the same cloned fs on the same system, # as the conflicting fsid will be rejected by btrfs. # Thus here we have to wipe the fs using a different btrfs. mkfs.btrfs -f $dev.backup dd if=$dev.backup of=$dev bs=1M xfs_freeze -u $mnt fsstress -w -d $mnt -n 20 umount $mnt btrfs check $dev The final fsck will fail due to some tree blocks has incorrect fsid. This is enough to emulate the problem hit by the unfortunate user. [ENHANCEMENT] Although such case should not be that common, it can still happen from time to time. From the view of btrfs, we can detect any unexpected super block change, and if there is any unexpected change, we just mark the fs read-only, and thaw the fs. By this we can limit the damage to minimal, and I hope no one would lose their data by this anymore. Suggested-by: Goffredo Baroncelli <kreijack@libero.it> Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
928ff3be |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: stop allocation a btrfs_io_context for simple I/O The I/O context structure is only used to pass the btrfs_device to the end I/O handler for I/Os that go to a single device. Stop allocating the I/O context for these cases by passing the optional btrfs_io_stripe argument to __btrfs_map_block to query the mapping information and then using a fast path submission and I/O completion handler. As the old btrfs_io_context based I/O submission path is only used for mirrored writes, rename the functions to make that clear and stop setting the btrfs_bio device and mirror_num field that is only used for reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
03793cbb |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: add fast path for single device io in __btrfs_map_block There is no need for most of the btrfs_io_context when doing I/O to a single device. To support such I/O without the extra btrfs_io_context allocation, turn the mirror_num argument into a pointer so that it can be used to output the selected mirror number, and add an optional argument that points to a btrfs_io_stripe structure, which will be filled with a single extent if provided by the caller. In that case the btrfs_io_context allocation can be skipped as all information for the single device I/O is provided in the mirror_num argument and the on-stack btrfs_io_stripe. A caller that makes use of this new argument will be added in the next commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28793b19 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: decide bio cloning inside submit_stripe_bio Remove the orig_bio argument as it can be derived from the bioc, and the clone argument as it can be calculated from bioc and dev_nr. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32747c44 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: factor out low-level bio setup from submit_stripe_bio Split out a low-level btrfs_submit_dev_bio helper that just submits the bio without any cloning decisions or setting up the end I/O handler for future reuse by a different caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
917f32a2 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: give struct btrfs_bio a real end_io handler Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io handler and bi_private pointer of the embedded struct bio are both used to handle the completion of the high-level btrfs_bio and for the I/O completion for the low-level device that the embedded bio ends up being sent to. To support this bi_end_io and bi_private are saved into the btrfs_io_context structure and then restored after the bio sent to the underlying device has completed the actual I/O. Untangle this by adding an end I/O handler and private data to struct btrfs_bio for the high-level btrfs_bio based completions, and leave the actual bio bi_end_io handler and bi_private pointer entirely to the low-level device I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f1c29379 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: properly abstract the parity raid bio handling The parity raid write/recover functionality is currently not very well abstracted from the bio submission and completion handling in volumes.c: - the raid56 code directly completes the original btrfs_bio fed into btrfs_submit_bio instead of dispatching back to volumes.c - the raid56 code consumes the bioc and bio_counter references taken by volumes.c, which also leads to special casing of the calls from the scrub code into the raid56 code To fix this up supply a bi_end_io handler that calls back into the volumes.c machinery, which then puts the bioc, decrements the bio_counter and completes the original bio, and updates the scrub code to also take ownership of the bioc and bio_counter in all cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3a62baf |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: use chained bios when cloning The stripes_pending in the btrfs_io_context counts number of inflight low-level bios for an upper btrfs_bio. For reads this is generally one as reads are never cloned, while for writes we can trivially use the bio remaining mechanisms that is used for chained bios. To be able to make use of that mechanism, split out a separate trivial end_io handler for the cloned bios that does a minimal amount of error tracking and which then calls bio_endio on the original bio to transfer control to that, with the remaining counter making sure it is completed last. This then allows to merge btrfs_end_bioc into the original bio bi_end_io handler. To make this all work all error handling needs to happen through the bi_end_io handler, which requires a small amount of reshuffling in submit_stripe_bio so that the bio is cloned already by the time the suitability of the device is checked. This reduces the size of the btrfs_io_context and prepares splitting the btrfs_bio at the stripe boundary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bbc72f1 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't take a bio_counter reference for cloned bios Stop grabbing an extra bio_counter reference for each clone bio in a mirrored write and instead just release the one original reference in btrfs_end_bioc once all the bios for a single btrfs_bio have completed instead of at the end of btrfs_submit_bio once all bios have been submitted. This means the reference is now carried by the "upper" btrfs_bio only instead of each lower bio. Also remove the now unused btrfs_bio_counter_inc_noblocked helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b42f5e3 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: pass the operation to btrfs_bio_alloc Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset set does. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d45cfb88 |
|
06-Aug-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move btrfs_bio allocation to volumes.c volumes.c is the place that implements the storage layer using the btrfs_bio structure, so move the bio_set and allocation helpers there as well. To make up for the new initialization boilerplate, merge the two init/exit helpers in extent_io.c into a single one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
588a4868 |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIR Before when this was modifying the bit field we had to protect it with the bg->lock, however now we're using bit helpers so we can stop using the bg->lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9283b9e0 |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3349b57f |
|
15-Jul-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: convert block group bit field to use bit helpers We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5da431b7 |
|
18-Aug-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix the max chunk size and stripe length calculation [BEHAVIOR CHANGE] Since commit f6fca3917b4d ("btrfs: store chunk size in space-info struct"), btrfs no longer can create larger data chunks than 1G: mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4 mount $dev1 $mnt btrfs balance start --full $mnt btrfs balance start --full $mnt umount $mnt btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2 Before that offending commit, what we got is a 4G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Now what we got is only 1G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 This will increase the number of data chunks by the number of devices, not only increase system chunk usage, but also greatly increase mount time. Without a proper reason, we should not change the max chunk size. [CAUSE] Previously, we set max data chunk size to 10G, while max data stripe length to 1G. Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct") completely ignored the 10G limit, but use 1G max stripe limit instead, causing above shrink in max data chunk size. [FIX] Fix the max data chunk size to 10G, and in decide_stripe_size_regular() we limit stripe_size to 1G manually. This should only affect data chunks, as for metadata chunks we always set the max stripe size the same as max chunk size (256M or 1G depending on fs size). Now the same script result the same old result: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Reported-by: Wang Yugui <wangyugui@e16-tech.com> Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ea0106a |
|
15-Aug-2022 |
Zixuan Fu <r33s3n6@gmail.com> |
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if the path is invalid. In this case, btrfs_get_dev_args_from_path() returns directly without freeing args->uuid and args->fsid allocated before, which causes memory leak. To fix these possible leaks, when btrfs_get_bdev_and_sb() fails, btrfs_put_dev_args_from_path() is called to clean up the memory. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Fixes: faa775c41d655 ("btrfs: add a btrfs_get_dev_args_from_path helper") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d28beb3e |
|
18-Jul-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: merge btrfs_dev_stat_print_on_error with its only caller Fold it into the only caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9af128d |
|
16-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: raid56: transfer the bio counter reference to the raid submission helpers Transfer the bio counter reference acquired by btrfs_submit_bio to raid56_parity_write and raid56_parity_recovery together with the bio that the reference was acquired for instead of acquiring another reference in those helpers and dropping the original one in btrfs_submit_bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6065fd95 |
|
16-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: do not return errors from raid56_parity_recover Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
31683f4a |
|
16-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: do not return errors from raid56_parity_write Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a722d8f |
|
16-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: do not return errors from btrfs_map_bio Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
462b0b2a |
|
16-Jun-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() For profiles other than RAID56, __btrfs_map_block() returns @map_length as min(stripe_end, logical + *length), which is also the same result from btrfs_get_io_geometry(). But for RAID56, __btrfs_map_block() returns @map_length as stripe_len. This strange behavior is going to hurt incoming bio split at btrfs_map_bio() time, as we will use @map_length as bio split size. Fix this behavior by returning @map_length by the same calculation as for other profiles. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff18a4af |
|
16-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: raid56: use fixed stripe length everywhere The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1867eb3 |
|
21-Jun-2022 |
David Sterba <dsterba@suse.com> |
btrfs: clean up chained assignments The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3613249a |
|
13-Jun-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: warn about dev extents that are inside the reserved range Btrfs on-disk format has reserved the first 1MiB for the primary super block (at 64KiB offset) and bootloaders may also use this space. This behavior is only introduced since v4.1 btrfs-progs release, although kernel can ensure we never touch the reserved range of super blocks, it's better to inform the end users, and a balance will resolve the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
37f85ec3 |
|
13-Jun-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use named constant for reserved device space There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d322b48 |
|
13-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies directly, only for RAID5 and RAID6 we need some extra handling as there's no table value for that. For RAID10 there's a change from sub_stripes to ncopies. The values are the same but semantically we want to use number of copies, as this is what btrfs_num_copies does. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b30f719 |
|
13-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use btrfs_raid_array to calculate number of parity stripes Use the raid table instead of hard coded values and rename the helper as it is exported. This could make later extension on RAID56 based profiles easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6dead96c |
|
13-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation In __btrfs_map_block() we have an assignment to @max_errors using nr_parity_stripes(). Although it works for RAID56 it's confusing. Replace it with btrfs_chunk_max_errors(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc88b486 |
|
13-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove parameter dev_extent_len from scrub_stripe() For scrub_stripe() we can easily calculate the dev extent length as we have the full info of the chunk. Thus there is no need to pass @dev_extent_len from the caller, and we introduce a helper, btrfs_calc_stripe_length(), to do the calculation from extent_map structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4012f06 |
|
03-Jun-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: split discard handling out of btrfs_map_block Mapping block for discard doesn't really share any code with the regular block mapping case. Split it out into an entirely separate helper that just returns an array of btrfs_discard_stripe structures and the number of stripes. This removes the need for the length field in the btrfs_io_context structure, so remove tht. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ff7ddd3 |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: do not allocate a btrfs_bio for low-level bios The bios submitted from btrfs_map_bio don't really interact with the rest of btrfs and the only btrfs_bio member actually used in the low-level bios is the pointer to the btrfs_io_context used for endio handler. Use a union in struct btrfs_io_stripe that allows the endio handler to find the btrfs_io_context and remove the spurious ->device assignment so that a plain fs_bio_set bio can be used for the low-level bios allocated inside btrfs_map_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a316a259 |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: factor stripe submission logic out of btrfs_map_bio Move all per-stripe handling into submit_stripe_bio and use a label to cleanup instead of duplicating the logic. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7b9416f |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove btrfs_end_io_wq All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4c46bde |
|
26-May-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move more work into btrfs_end_bioc Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of duplicating the logic in the callers. Also remove the bio argument as it always must be bioc->orig_bio and the now pointless bioc_error that did nothing but assign bi_sector to the same value just sampled in the caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f6fca391 |
|
08-Feb-2022 |
Stefan Roesch <shr@fb.com> |
btrfs: store chunk size in space-info struct The chunk size is stored in the btrfs_space_info structure. It is initialized at the start and is then used. A new API is added to update the current chunk size. This API is used to be able to expose the chunk_size as a sysfs setting. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename and merge helpers, switch atomic type to u64, style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b036f479 |
|
22-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: quit early if the fs has no RAID56 support for raid56 related checks The following functions do special handling for RAID56 chunks: - btrfs_is_parity_mirror() Check if the range is in RAID56 chunks. - btrfs_full_stripe_len() Either return sectorsize for non-RAID56 profiles or full stripe length for RAID56 chunks. But if a filesystem without any RAID56 chunks, it will not have RAID56 incompat flags, and we can skip the chunk tree looking up completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
719fae89 |
|
20-Apr-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use ilog2() to replace if () branches for btrfs_bg_flags_to_raid_index() In function btrfs_bg_flags_to_raid_index(), we use quite some if () to convert the BTRFS_BLOCK_GROUP_* bits to a index number. But the truth is, there is really no such need for so many branches at all. Since all BTRFS_BLOCK_GROUP_* flags are just one single bit set inside BTRFS_BLOCK_GROUP_PROFILES_MASK, we can easily use ilog2() to calculate their values. This calculation has an anchor point, the lowest PROFILE bit, which is RAID0. Even it's fixed on-disk format and should never change, here I added extra compile time checks to make it super safe: 1. Make sure RAID0 is always the lowest bit in PROFILE_MASK This is done by finding the first (least significant) bit set of RAID0 and PROFILE_MASK & ~RAID0. 2. Make sure RAID0 bit set beyond the highest bit of TYPE_MASK Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a7b8e39c |
|
01-Apr-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: raid56: enable subpage support for RAID56 Now the btrfs RAID56 infrastructure has migrated to use sector_ptr interface, it should be safe to enable subpage support for RAID56. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc353a8b |
|
12-Apr-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: reduce width for stripe_len from u64 to u32 Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough for the usage. Furthermore, even in the future we choose to enlarge stripe length to larger values, I don't believe we would want stripe as large as 4G or larger. So this patch will reduce the width for all in-memory structures and parameters, this involves: - RAID56 related function argument lists This allows us to do direct division related to stripe_len. Although we will use bits shift to replace the division anyway. - btrfs_io_geometry structure This involves one change to simplify the calculation of both @stripe_nr and @stripe_offset, using div64_u64_rem(). And add extra sanity check to make sure @stripe_offset is always small enough for u32. This saves 8 bytes for the structure. - map_lookup structure This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes, and removed a 4-bytes hole) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d201238c |
|
28-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: repair super block num_devices automatically [BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. This has been fixed by commit bbac58698a55 ("btrfs: remove device item and update super block in the same transaction"). [FIX] Make the super_num_devices check less strict, converting it from a hard error to a warning, and reset the value to a correct one for the current or next transaction commit. As the number of device items is the critical information where the super block num_devices is only a cached value (and also useful for cross checking), it's safe to automatically update it. Other device related problems like missing device are handled after that and may require other means to resolve, like degraded mount. With this fix, potentially affected filesystems won't fail mount and require the manual repair by btrfs check. Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
110ac0e5 |
|
03-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: pass a block_device to btrfs_bio_clone Pass the block_device to bio_alloc_clone instead of setting it later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fce3f24a |
|
03-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: move the call to bio_set_dev out of submit_stripe_bio Prepare for additional refactoring, btrfs_map_bio is direct caller of submit_stripe_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
58ff51f1 |
|
03-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: check-integrity: split submit_bio from btrfsic checking Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d031dc4 |
|
31-Mar-2022 |
Yu Zhe <yuzhe@nfschina.com> |
btrfs: remove unnecessary type casts Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e959d3c1 |
|
12-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use dummy extent buffer for super block sys chunk array read In function btrfs_read_sys_array(), we allocate a real extent buffer using btrfs_find_create_tree_block(). Such extent buffer will be even cached into buffer_radix tree, and using btree inode address space. However we only use such extent buffer to enable the accessors, thus we don't even need to bother using real extent buffer, a dummy one is what we really need. And for dummy extent buffer, we no longer need to do any special handling for the first page, as subpage helper is already doing it properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43cb1478 |
|
09-Mar-2022 |
Gabriel Niebler <gniebler@suse.com> |
btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
10f0d2a5 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: add a bdev_nonrot helper Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
f9e69aa9 |
|
06-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
btrfs: simplify ->flush_bio handling Use and embedded bios that is initialized when used instead of bio_kmalloc plus bio_reset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220406061228.410163-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
a690e5f2 |
|
29-Mar-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: mark resumed async balance as writing When btrfs balance is interrupted with umount, the background balance resumes on the next mount. There is a potential deadlock with FS freezing here like as described in commit 26559780b953 ("btrfs: zoned: mark relocation as writing"). Mark the process as sb_writing to avoid it. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbac5869 |
|
07-Mar-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove device item and update super block in the same transaction [BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. [FIX] Instead of starting and committing a transaction inside btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and pass it to btrfs_rm_dev_item(). And only commit the transaction after everything is done. Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79c9234b |
|
03-Mar-2022 |
Dongliang Mu <mudongliangabcd@gmail.com> |
btrfs: don't access possibly stale fs_info data in device_list_add Syzbot reported a possible use-after-free in printing information in device_list_add. Very similar with the bug fixed by commit 0697d9a61099 ("btrfs: don't access possibly stale fs_info data for printing duplicate device"), but this time the use occurs in btrfs_info_in_rcu. Call Trace: kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 btrfs_printk+0x395/0x425 fs/btrfs/super.c:244 device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957 btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387 btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by modifying device->fs_info to NULL too. Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ca5e4ea0 |
|
17-Feb-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: mark relocation as writing There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
914a519b |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: disable device manipulation ioctl's EXTENT_TREE_V2 Device add, remove, and replace all require balance, which doesn't work right now on extent tree v2, so disable these for now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4b349253 |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: disable balance for extent tree v2 for now With global root id's it makes it problematic to do backref lookups for balance. This isn't hard to deal with, but future changes are going to make it impossible to lookup backrefs on any COWonly roots, so go ahead and disable balance for now on extent tree v2 until we can add balance support back in future patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
823f8e5c |
|
12-Jan-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup temporary variables when finding rotational device status The pointer to struct request_queue is used only to get device type rotating or the non-rotating. So use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
330a5bf4 |
|
11-Jan-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use dev_t to match device in device_matched Commit "btrfs: add device major-minor info in the struct btrfs_device" saved the device major-minor number in the struct btrfs_device upon discovering it. So no need to lookup_bdev() again just match, which means device_matched() can go away. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4889bc05 |
|
11-Jan-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add device major-minor info in the struct btrfs_device Internally it is common to use the major-minor number to identify a device and, at a few locations in btrfs, we use the major-minor number to match the device. So when we identify a new btrfs device through device add or device replace or device-scan/ready save the device's major-minor (dev_t) in the struct btrfs_device so that we don't have to call lookup_bdev() again. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16cab91a |
|
11-Jan-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: match stale devices by dev_t After the commit "btrfs: harden identification of the stale device", we don't have to match the device path anymore. Instead, we match the dev_t. So pass in the dev_t instead of the device path, in the call chain btrfs_forget_devices()->btrfs_free_stale_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
770c79fb |
|
11-Jan-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: harden identification of a stale device Identifying and removing the stale device from the fs_uuids list is done by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn depends on device_path_matched() to check if the device appears in more than one btrfs_device structure. The matching of the device happens by its path, the device path. However, when device mapper is in use, the dm device paths are nothing but a link to the actual block device, which leads to the device_path_matched() failing to match. Fix this by matching the dev_t as provided by lookup_bdev() instead of plain string compare of the device paths. Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff37c89f |
|
11-Jan-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: move missing device handling in a dedicate function This simplifies the code flow in read_one_chunk and makes error handling when handling missing devices a bit simpler by reducing it to a single check if something went wrong. No functional changes. Reviewed-by: Su Yue <l@damenly.su> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f26c9238 |
|
14-Dec-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove reada infrastructure Currently there is only one user for btrfs metadata readahead, and that's scrub. But even for the single user, it's not providing the correct functionality it needs, as scrub needs reada for commit root, which current readahead can't provide. (Although it's pretty easy to add such feature). Despite this, there are some extra problems related to metadata readahead: - Duplicated feature with btrfs_path::reada - Partly duplicated feature of btrfs_fs_info::buffer_radix Btrfs already caches its metadata in buffer_radix, while readahead tries to read the tree block no matter if it's already cached. - Poor layer separation Metadata readahead works kinda at device level. This is definitely not the correct layer it should be, since metadata is at btrfs logical address space, it should not bother device at all. This brings extra chance for bugs to sneak in, while brings unnecessary complexity. - Dead code In the very beginning of scrub.c we have #undef DEBUG, rendering all the debug related code useless and unable to test. Thus here I purpose to remove the metadata readahead mechanism completely. [BENCHMARK] There is a full benchmark for the scrub performance difference using the old btrfs_reada_add() and btrfs_path::reada. For the worst case (no dirty metadata, slow HDD), there could be a 5% performance drop for scrub. For other cases (even SATA SSD), there is no distinguishable performance difference. The number is reported scrub speed, in MiB/s. The resolution is limited by the reported duration, which only has a resolution of 1 second. Old New Diff SSD 455.3 466.332 +2.42% HDD 103.927 98.012 -5.69% Comprehensive test methodology is in the cover letter of the patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
554aed7d |
|
07-Dec-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: sink zone check into btrfs_repair_one_zone Sink zone check into btrfs_repair_one_zone() so we don't need to do it in all callers. Also as btrfs_repair_one_zone() doesn't return a sensible error, make it a boolean function and return false in case it got called on a non-zoned filesystem and true on a zoned filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efc0e69c |
|
25-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: introduce exclusive operation BALANCE_PAUSED state Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd51eb2f |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't use the extent root in btrfs_chunk_alloc_add_chunk_item We're just using the extent_root to set the chunk owner to root_key->objectid, which is BTRFS_EXTENT_TREE_OBJECTID, so use that directly instead of using the root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf08387f |
|
18-Nov-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't check stripe length if the profile is not stripe based [BUG] When debugging calc_bio_boundaries(), I found that even for RAID1 metadata, we're following stripe length to calculate stripe boundary. # mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12] # mount /dev/test/scratch /mnt/btrfs # xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file # umount Above very basic operations will make calc_bio_boundaries() to report the following result: submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152 submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536 ... submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384 submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384 submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536 submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536 submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152 submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768 Where "r/i" is the rootid and inode, 1/1 means they metadata. The remaining names match the member used in kernel. Even all data/metadata are using RAID1, we're still following stripe length. [CAUSE] This behavior is caused by a wrong condition in btrfs_get_io_geometry(): if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { /* Fill using stripe_len */ len = min_t(u64, em->len - offset, max_len); } else { len = em->len - offset; } This means, only for SINGLE we will not follow stripe_len. However for profiles like RAID1*, DUP, they don't need to bother stripe_len. This can lead to unnecessary bio split for RAID1*/DUP profiles, and can even be a blockage for future zoned RAID support. [FIX] Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and change the condition to only calculate the length using stripe length for stripe based profiles. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16beac87 |
|
10-Nov-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: cache reported zone during mount When mounting a device, we are reporting the zones twice: once for checking the zone attributes in btrfs_get_dev_zone_info and once for loading block groups' zone info in btrfs_load_block_group_zone_info(). With a lot of block groups, that leads to a lot of REPORT ZONE commands and slows down the mount process. This patch introduces a zone info cache in struct btrfs_zoned_device_info. The cache is populated while in btrfs_get_dev_zone_info() and used for btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE commands. The zone cache is then released after loading the block groups, as it will not be much effective during the run time. Benchmark: Mount an HDD with 57,007 block groups Before patch: 171.368 seconds After patch: 64.064 seconds While it still takes a minute due to the slowness of loading all the block groups, the patch reduces the mount time by 1/3. Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/ Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
849eae5e |
|
09-Nov-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: consolidate device_list_mutex in prepare_sprout to its parent btrfs_prepare_sprout() splices seed devices into its own struct fs_devices, so that its parent function btrfs_init_new_device() can add the new sprout device to fs_info->fs_devices. Both btrfs_prepare_sprout() and btrfs_init_new_device() need device_list_mutex. But they are holding it separately, thus create a small race window. Close it and hold device_list_mutex across both functions btrfs_init_new_device() and btrfs_prepare_sprout(). Split btrfs_prepare_sprout() into btrfs_init_sprout() and btrfs_setup_sprout(). This split is essential because device_list_mutex must not be held for allocations in btrfs_init_sprout() but must be held for btrfs_setup_sprout(). So now a common device_list_mutex can be used between btrfs_init_new_device() and btrfs_setup_sprout(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd880809 |
|
20-Sep-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: switch seeding_dev in init_new_device to bool Declare int seeding_dev as a bool. Also, move its declaration a line below to adjust packing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3212fa14 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the _nr from the item helpers Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4989d4a0 |
|
15-Dec-2021 |
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> |
btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() The function btrfs_scan_one_device() calls blkdev_get_by_path() and blkdev_put() to get and release its target block device. However, when btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the block device is left without clean up. This triggered failure of fstests generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to call blkdev_put(). Fixes: 12659251ca5df ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4d9380e0 |
|
03-Nov-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: silence lockdep when reading chunk tree during mount Often some test cases like btrfs/161 trigger lockdep splats that complain about possible unsafe lock scenario due to the fact that during mount, when reading the chunk tree we end up calling blkdev_get_by_path() while holding a read lock on a leaf of the chunk tree. That produces a lockdep splat like the following: [ 3653.683975] ====================================================== [ 3653.685148] WARNING: possible circular locking dependency detected [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted [ 3653.687239] ------------------------------------------------------ [ 3653.688400] mount/447465 is trying to acquire lock: [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.691054] but task is already holding lock: [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.693978] which lock already depends on the new lock. [ 3653.695510] the existing dependency chain (in reverse order) is: [ 3653.696915] -> #3 (btrfs-chunk-00){++++}-{3:3}: [ 3653.698053] down_read_nested+0x4b/0x140 [ 3653.698893] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.699988] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 3653.701205] btrfs_search_slot+0x537/0xc00 [btrfs] [ 3653.702234] btrfs_insert_empty_items+0x32/0x70 [btrfs] [ 3653.703332] btrfs_init_new_device+0x563/0x15b0 [btrfs] [ 3653.704439] btrfs_ioctl+0x2110/0x3530 [btrfs] [ 3653.705405] __x64_sys_ioctl+0x83/0xb0 [ 3653.706215] do_syscall_64+0x3b/0xc0 [ 3653.706990] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.708040] -> #2 (sb_internal#2){.+.+}-{0:0}: [ 3653.708994] lock_release+0x13d/0x4a0 [ 3653.709533] up_write+0x18/0x160 [ 3653.710017] btrfs_sync_file+0x3f3/0x5b0 [btrfs] [ 3653.710699] __loop_update_dio+0xbd/0x170 [loop] [ 3653.711360] lo_ioctl+0x3b1/0x8a0 [loop] [ 3653.711929] block_ioctl+0x48/0x50 [ 3653.712442] __x64_sys_ioctl+0x83/0xb0 [ 3653.712991] do_syscall_64+0x3b/0xc0 [ 3653.713519] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.714233] -> #1 (&lo->lo_mutex){+.+.}-{3:3}: [ 3653.715026] __mutex_lock+0x92/0x900 [ 3653.715648] lo_open+0x28/0x60 [loop] [ 3653.716275] blkdev_get_whole+0x28/0x90 [ 3653.716867] blkdev_get_by_dev.part.0+0x142/0x320 [ 3653.717537] blkdev_open+0x5e/0xa0 [ 3653.718043] do_dentry_open+0x163/0x390 [ 3653.718604] path_openat+0x3f0/0xa80 [ 3653.719128] do_filp_open+0xa9/0x150 [ 3653.719652] do_sys_openat2+0x97/0x160 [ 3653.720197] __x64_sys_openat+0x54/0x90 [ 3653.720766] do_syscall_64+0x3b/0xc0 [ 3653.721285] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.721986] -> #0 (&disk->open_mutex){+.+.}-{3:3}: [ 3653.722775] __lock_acquire+0x130e/0x2210 [ 3653.723348] lock_acquire+0xd7/0x310 [ 3653.723867] __mutex_lock+0x92/0x900 [ 3653.724394] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.725041] blkdev_get_by_path+0xb8/0xd0 [ 3653.725614] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.726332] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.726999] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.727739] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.728384] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.729130] legacy_get_tree+0x30/0x50 [ 3653.729676] vfs_get_tree+0x28/0xc0 [ 3653.730192] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.730800] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.731427] legacy_get_tree+0x30/0x50 [ 3653.731970] vfs_get_tree+0x28/0xc0 [ 3653.732486] path_mount+0x2d4/0xbe0 [ 3653.732997] __x64_sys_mount+0x103/0x140 [ 3653.733560] do_syscall_64+0x3b/0xc0 [ 3653.734080] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.734782] other info that might help us debug this: [ 3653.735784] Chain exists of: &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00 [ 3653.737123] Possible unsafe locking scenario: [ 3653.737865] CPU0 CPU1 [ 3653.738435] ---- ---- [ 3653.739007] lock(btrfs-chunk-00); [ 3653.739449] lock(sb_internal#2); [ 3653.740193] lock(btrfs-chunk-00); [ 3653.740955] lock(&disk->open_mutex); [ 3653.741431] *** DEADLOCK *** [ 3653.742176] 3 locks held by mount/447465: [ 3653.742739] #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0 [ 3653.744114] #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs] [ 3653.745563] #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.747066] stack backtrace: [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1 [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 3653.750592] Call Trace: [ 3653.750967] dump_stack_lvl+0x57/0x72 [ 3653.751526] check_noncircular+0xf3/0x110 [ 3653.752136] ? stack_trace_save+0x4b/0x70 [ 3653.752748] __lock_acquire+0x130e/0x2210 [ 3653.753356] lock_acquire+0xd7/0x310 [ 3653.753898] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.754596] ? lock_is_held_type+0xe8/0x140 [ 3653.755125] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.755729] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.756338] __mutex_lock+0x92/0x900 [ 3653.756794] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.757400] ? do_raw_spin_unlock+0x4b/0xa0 [ 3653.757930] ? _raw_spin_unlock+0x29/0x40 [ 3653.758437] ? bd_prepare_to_claim+0x129/0x150 [ 3653.758999] ? trace_module_get+0x2b/0xd0 [ 3653.759508] ? try_module_get.part.0+0x50/0x80 [ 3653.760072] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.760661] ? devcgroup_check_permission+0xc1/0x1f0 [ 3653.761288] blkdev_get_by_path+0xb8/0xd0 [ 3653.761797] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.762454] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.763055] ? clone_fs_devices+0x8f/0x170 [btrfs] [ 3653.763689] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.764370] ? kvm_sched_clock_read+0x14/0x40 [ 3653.764922] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.765493] ? super_setup_bdi_name+0x79/0xd0 [ 3653.766043] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.766780] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.767488] ? kfree+0x1f2/0x3c0 [ 3653.767979] legacy_get_tree+0x30/0x50 [ 3653.768548] vfs_get_tree+0x28/0xc0 [ 3653.769076] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.769718] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.770381] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.771086] ? kfree+0x1f2/0x3c0 [ 3653.771574] legacy_get_tree+0x30/0x50 [ 3653.772136] vfs_get_tree+0x28/0xc0 [ 3653.772673] path_mount+0x2d4/0xbe0 [ 3653.773201] __x64_sys_mount+0x103/0x140 [ 3653.773793] do_syscall_64+0x3b/0xc0 [ 3653.774333] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.775094] RIP: 0033:0x7f648bc45aaa This happens because through btrfs_read_chunk_tree(), which is called only during mount, ends up acquiring the mutex open_mutex of a block device while holding a read lock on a leaf of the chunk tree while other paths need to acquire other locks before locking extent buffers of the chunk tree. Since at mount time when we call btrfs_read_chunk_tree() we know that we don't have other tasks running in parallel and modifying the chunk tree, we can simply skip locking of chunk tree extent buffers. So do that and move the assertion that checks the fs is not yet mounted to the top block of btrfs_read_chunk_tree(), with a comment before doing it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d03dbeb |
|
04-Oct-2021 |
Li Zhang <zhanglikernel@gmail.com> |
btrfs: clear MISSING device status bit in btrfs_close_one_device Reported bug: https://github.com/kdave/btrfs-progs/issues/389 There's a problem with scrub reporting aborted status but returning error code 0, on a filesystem with missing and readded device. Roughly these steps: - mkfs -d raid1 dev1 dev2 - fill with data - unmount - make dev1 disappear - mount -o degraded - copy more data - make dev1 appear again Running scrub afterwards reports that the command was aborted, but the system log message says the exit code was 0. It seems that the cause of the error is decrementing fs_devices->missing_devices but not clearing device->dev_state. Every time we umount filesystem, it would call close_ctree, And it would eventually involve btrfs_close_one_device to close the device, but it only decrements fs_devices->missing_devices but does not clear the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer Overflow, because every time umount, fs_devices->missing_devices will decrease. If fs_devices->missing_devices value hit 0, it would overflow. With added debugging: loop1: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311) loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615 If fs_devices->missing_devices is 0, next time it would be 18446744073709551615 After apply this patch, the fs_devices->missing_devices seems to be right: $ truncate -s 10g test1 $ truncate -s 10g test2 $ losetup /dev/loop1 test1 $ losetup /dev/loop2 test2 $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f $ losetup -d /dev/loop2 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ dmesg loop1: detected capacity change from 0 to 20971520 loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863) BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): checking UUID tree BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54fde91f |
|
14-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: update device path inode time instead of bd_inode Christoph pointed out that I'm updating bdev->bd_inode for the device time when we remove block devices from a btrfs file system, however this isn't actually exposed to anything. The inode we want to update is the one that's associated with the path to the device, usually on devtmpfs, so that blkid notices the difference. We still don't want to do the blkdev_open, so use kern_path() to get the path to the given device and do the update time on that inode. Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing") Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2bb2e00e |
|
13-Oct-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock between chunk allocation and chunk btree modifications When a task is doing some modification to the chunk btree and it is not in the context of a chunk allocation or a chunk removal, it can deadlock with another task that is currently allocating a new data or metadata chunk. These contexts are the following: * When relocating a system chunk, when we need to COW the extent buffers that belong to the chunk btree; * When adding a new device (ioctl), where we need to add a new device item to the chunk btree; * When removing a device (ioctl), where we need to remove a device item from the chunk btree; * When resizing a device (ioctl), where we need to update a device item in the chunk btree and may need to relocate a system chunk that lies beyond the new device size when shrinking a device. The problem happens due to a sequence of steps like the following: 1) Task A starts a data or metadata chunk allocation and it locks the chunk mutex; 2) Task B is relocating a system chunk, and when it needs to COW an extent buffer of the chunk btree, it has locked both that extent buffer as well as its parent extent buffer; 3) Since there is not enough available system space, either because none of the existing system block groups have enough free space or because the only one with enough free space is in RO mode due to the relocation, task B triggers a new system chunk allocation. It blocks when trying to acquire the chunk mutex, currently held by task A; 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert the new chunk item into the chunk btree and update the existing device items there. But in order to do that, it has to lock the extent buffer that task B locked at step 2, or its parent extent buffer, but task B is waiting on the chunk mutex, which is currently locked by task A, therefore resulting in a deadlock. One example report when the deadlock happens with system chunk relocation: INFO: task kworker/u9:5:546 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993 __down_read_common kernel/locking/rwsem.c:1214 [inline] __down_read kernel/locking/rwsem.c:1223 [inline] down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590 __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47 btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline] btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191 btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline] btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728 btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794 btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504 do_chunk_alloc fs/btrfs/block-group.c:3408 [inline] btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653 flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670 btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297 worker_thread+0x90/0xed0 kernel/workqueue.c:2444 kthread+0x3e5/0x4d0 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 INFO: task syz-executor:9107 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631 find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline] find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335 btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415 btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813 __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415 btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570 btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768 relocate_tree_block fs/btrfs/relocation.c:2694 [inline] relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757 relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673 btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070 btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181 __btrfs_balance fs/btrfs/volumes.c:3911 [inline] btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301 btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137 btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae So fix this by making sure that whenever we try to modify the chunk btree and we are neither in a chunk allocation context nor in a chunk remove context, we reserve system space before modifying the chunk btree. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/ Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array") CC: stable@vger.kernel.org # 5.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a15eb72 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls For device removal and replace we call btrfs_find_device_by_devspec, which if we give it a device path and nothing else will call btrfs_get_dev_args_from_path, which opens the block device and reads the super block and then looks up our device based on that. However at this point we're holding the sb write "lock", so reading the block device pulls in the dependency of ->open_mutex, which produces the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #405 Not tainted ------------------------------------------------------ losetup/11576 is trying to acquire lock: ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_get_by_dev.part.0+0x56/0x3c0 blkdev_get_by_path+0x98/0xa0 btrfs_get_bdev_and_sb+0x1b/0xb0 btrfs_find_device_by_devspec+0x12b/0x1c0 btrfs_rm_device+0x127/0x610 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11576: #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f31b02404cb Instead what we want to do is populate our device lookup args before we grab any locks, and then pass these args into btrfs_rm_device(). From there we can find the device and do the appropriate removal. Suggested-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
faa775c4 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a btrfs_get_dev_args_from_path helper We are going to want to populate our device lookup args outside of any locks and then do the actual device lookup later, so add a helper to do this work and make btrfs_find_device_by_devspec() use this helper for now. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
562d7b15 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle device lookup with btrfs_dev_lookup_args We have a lot of device lookup functions that all do something slightly different. Clean this up by adding a struct to hold the different lookup criteria, and then pass this around to btrfs_find_device() so it can do the proper matching based on the lookup criteria. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b41393f |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not call close_fs_devices in btrfs_rm_device There's a subtle case where if we're removing the seed device from a file system we need to free its private copy of the fs_devices. However we do not need to call close_fs_devices(), because at this point there are no devices left to close as we've closed the last one. The only thing that close_fs_devices() does is decrement ->opened, which should be 1. We want to avoid calling close_fs_devices() here because it has a lockdep_assert_held(&uuid_mutex), and we are going to stop holding the uuid_mutex in this path. So simply decrement the ->opened counter like we should, and then clean up like normal. Also add a comment explaining what we're doing here as I initially removed this code erroneously. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e906945 |
|
05-Oct-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use num_device to check for the last surviving seed device For both sprout and seed fsids, btrfs_fs_devices::num_devices provides device count including missing btrfs_fs_devices::open_devices provides device count excluding missing We create a dummy struct btrfs_device for the missing device, so num_devices != open_devices when there is a missing device. In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices before freeing the seed fs_devices. Instead we should check for %cur_devices->num_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a258d72 |
|
23-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove btrfs_raid_bio::fs_info member We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is always passed into alloc_rbio(), and only get released when the raid bio is released. Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info parameters for alloc_rbio() callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
731ccf15 |
|
23-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: make sure btrfs_io_context::fs_info is always initialized Currently btrfs_io_context::fs_info is only initialized in btrfs_map_bio, but there are call sites like btrfs_map_sblock() which calls __btrfs_map_block() directly, leaving bioc::fs_info uninitialized (NULL). Currently this is fine, but later cleanup will rely on bioc::fs_info to grab fs_info, and this can be a hidden problem for such usage. This patch will remove such hidden uninitialized member by always assigning bioc::fs_info at alloc_btrfs_io_context(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ef9dc0f |
|
27-Jul-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not take the uuid_mutex in btrfs_rm_device We got the following lockdep splat while running fstests (specifically btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which converted loop to using workqueues, which comes with lockdep annotations that don't exist with kworkers. The lockdep splat is as follows: WARNING: possible circular locking dependency detected 5.14.0-rc2-custom+ #34 Not tainted ------------------------------------------------------ losetup/156417 is trying to acquire lock: ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600 but task is already holding lock: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x28/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x163/0x3a0 path_openat+0x74d/0xa40 do_filp_open+0x9c/0x140 do_sys_openat2+0xb1/0x170 __x64_sys_openat+0x54/0x90 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 blkdev_get_by_dev.part.0+0xd1/0x3c0 blkdev_get_by_path+0xc0/0xd0 btrfs_scan_one_device+0x52/0x1f0 [btrfs] btrfs_control_ioctl+0xac/0x170 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (uuid_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 btrfs_rm_device+0x48/0x6a0 [btrfs] btrfs_ioctl+0x2d1c/0x3110 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#11){.+.+}-{0:0}: lo_write_bvec+0x112/0x290 [loop] loop_process_work+0x25f/0xcb0 [loop] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x266/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 flush_workqueue+0xae/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/156417: #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] stack backtrace: CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0x10a/0x120 __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 ? flush_workqueue+0x84/0x600 flush_workqueue+0xae/0x600 ? flush_workqueue+0x84/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] ? __lock_acquire+0x3a0/0x1dc0 ? update_dl_rq_load_avg+0x152/0x360 ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f645884de6b Usually the uuid_mutex exists to protect the fs_devices that map together all of the devices that match a specific uuid. In rm_device we're messing with the uuid of a device, so it makes sense to protect that here. However in doing that it pulls in a whole host of lockdep dependencies, as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus we end up with the dependency chain under the uuid_mutex being added under the normal sb write dependency chain, which causes problems with loop devices. We don't need the uuid mutex here however. If we call btrfs_scan_one_device() before we scratch the super block we will find the fs_devices and not find the device itself and return EBUSY because the fs_devices is open. If we call it after the scratch happens it will not appear to be a valid btrfs file system. We do not need to worry about other fs_devices modifying operations here because we're protected by the exclusive operations locking. So drop the uuid_mutex here in order to fix the lockdep splat. A more detailed explanation from the discussion: We are worried about rm and scan racing with each other, before this change we'll zero the device out under the UUID mutex so when scan does run it'll make sure that it can go through the whole device scan thing without rm messing with us. We aren't worried if the scratch happens first, because the result is we don't think this is a btrfs device and we bail out. The only case we are concerned with is we scratch _after_ scan is able to read the superblock and gets a seemingly valid super block, so lets consider this case. Scan will call device_list_add() with the device we're removing. We'll call find_fsid_with_metadata_uuid() and get our fs_devices for this UUID. At this point we lock the fs_devices->device_list_mutex. This is what protects us in this case, but we have two cases here. 1. We aren't to the device removal part of the RM. We found our device, and device name matches our path, we go down and we set total_devices to our super number of devices, which doesn't affect anything because we haven't done the remove yet. 2. We are past the device removal part, which is protected by the device_list_mutex. Scan doesn't find the device, it goes down and does the if (fs_devices->opened) return -EBUSY; check and we bail out. Nothing about this situation is ideal, but the lockdep splat is real, and the fix is safe, tho admittedly a bit scary looking. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more from the discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3a3b19b |
|
15-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename struct btrfs_io_bio to btrfs_bio Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c664611 |
|
15-Sep-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename btrfs_bio to btrfs_io_context The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cdccc03a |
|
23-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove stale comment about the btrfs_show_devname There were few lockdep warnings because btrfs_show_devname() was using device_list_mutex as recorded in the commits: 0ccd05285e7f ("btrfs: fix a possible umount deadlock") 779bf3fefa83 ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex") And finally, commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") removed the device_list_mutex from btrfs_show_devname for performance reasons. This patch removes a stale comment about the function btrfs_show_devname and device_list_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b7cb29e6 |
|
23-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: update latest_dev when we create a sprout device When we add a device to the seed filesystem (sprouting) it is a new filesystem (and fsid) on the device added. Update the latest_dev so that /proc/self/mounts shows the correct device. Example: $ btrfstune -S1 /dev/vg/seed $ mount /dev/vg/seed /btrfs mount: /btrfs: WARNING: device write-protected, mounted read-only. $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 $ btrfs dev add -f /dev/vg/new /btrfs Before: $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 After: $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d24fa5c1 |
|
23-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: convert latest_bdev type to btrfs_device and rename In preparation to fix a bug in btrfs_show_devname(). Convert fs_devices::latest_bdev type from struct block_device to struct btrfs_device and, rename the member to fs_devices::latest_dev. So that btrfs_show_devname() can use fs_devices::latest_dev::name. Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a09f23c3 |
|
23-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename and switch to bool btrfs_chunk_readonly btrfs_chunk_readonly() checks if the given chunk is writeable. It returns 1 for readonly, and 0 for writeable. So the return argument type bool shall suffice instead of the current type int. Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we check if the bg is writeable, and helps to keep the logic at the parent function simpler to understand. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9675ea8c |
|
17-Aug-2021 |
Su Yue <l@damenly.su> |
btrfs: update comment for fs_devices::seed_list in btrfs_rm_device Update it since commit 944d3f9fac61 ("btrfs: switch seed device to list api") did conversion from fs_devices::seed to fs_devices::seed_list. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f6f39f7a |
|
18-Aug-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk The user facing function used to allocate new chunks is btrfs_chunk_alloc, unfortunately there is yet another similar sounding function - btrfs_alloc_chunk. This creates confusion, especially since the latter function can be considered "private" in the sense that it implements the first stage of chunk creation and as such is called by btrfs_chunk_alloc. To avoid the awkwardness that comes with having similarly named but distinctly different in their purpose function rename btrfs_alloc_chunk to btrfs_create_chunk, given that the main purpose of this function is to orchestrate the whole process of allocating a chunk - reserving space into devices, deciding on characteristics of the stripe size and creating the in-memory structures. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1226dfff |
|
19-Oct-2021 |
Christoph Hellwig <hch@lst.de> |
btrfs: use sync_blockdev Use sync_blockdev instead of opencoding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20211019062530.2174626-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
cda00eba |
|
17-Oct-2021 |
Christoph Hellwig <hch@lst.de> |
btrfs: use bdev_nr_bytes instead of open coding it Use the proper helper to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20211018101130.1838532-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
6b225baa |
|
08-Sep-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix mount failure due to past and transient device flush error When we get an error flushing one device, during a super block commit, we record the error in the device structure, in the field 'last_flush_error'. This is used to later check if we should error out the super block commit, depending on whether the number of flush errors is greater than or equals to the maximum tolerated device failures for a raid profile. However if we get a transient device flush error, unmount the filesystem and later try to mount it, we can fail the mount because we treat that past error as critical and consider the device is missing. Even if it's very likely that the error will happen again, as it's probably due to a hardware related problem, there may be cases where the error might not happen again. One example is during testing, and a test case like the new generic/648 from fstests always triggers this. The test cases generic/019 and generic/475 also trigger this scenario, but very sporadically. When this happens we get an error like this: $ mount /dev/sdc /mnt mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error. $ dmesg (...) [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices [12918.890853] BTRFS error (device sdc): open_ctree failed The failure happens because when btrfs_check_rw_degradable() is called at mount time, or at remount from RO to RW time, is sees a non zero value in a device's ->last_flush_error attribute, and therefore considers that the device is 'missing'. Fix this by setting a device's ->last_flush_error to zero when we close a device, making sure the error is not seen on the next mount attempt. We only need to track flush errors during the current mount, so that we never commit a super block if such errors happened. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1247069 |
|
30-Aug-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix lockdep warning while mounting sprout fs Following test case reproduces lockdep warning. Test case: $ mkfs.btrfs -f <dev1> $ btrfstune -S 1 <dev1> $ mount <dev1> <mnt> $ btrfs device add <dev2> <mnt> -f $ umount <mnt> $ mount <dev2> <mnt> $ umount <mnt> The warning claims a possible ABBA deadlock between the threads initiated by [#1] btrfs device add and [#0] the mount. [ 540.743122] WARNING: possible circular locking dependency detected [ 540.743129] 5.11.0-rc7+ #5 Not tainted [ 540.743135] ------------------------------------------------------ [ 540.743142] mount/2515 is trying to acquire lock: [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs] [ 540.743458] but task is already holding lock: [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743541] which lock already depends on the new lock. [ 540.743543] the existing dependency chain (in reverse order) is: [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}: [ 540.743566] down_read_nested+0x48/0x2b0 [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs] [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs] [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs] [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs] [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs] [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs] [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs] [ 540.744166] __x64_sys_ioctl+0xe4/0x140 [ 540.744170] do_syscall_64+0x4b/0x80 [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}: [ 540.744184] __lock_acquire+0x155f/0x2360 [ 540.744188] lock_acquire+0x10b/0x5c0 [ 540.744190] __mutex_lock+0xb1/0xf80 [ 540.744193] mutex_lock_nested+0x27/0x30 [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs] [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs] [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs] [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs] [ 540.744472] legacy_get_tree+0x38/0x90 [ 540.744475] vfs_get_tree+0x30/0x140 [ 540.744478] fc_mount+0x16/0x60 [ 540.744482] vfs_kern_mount+0x91/0x100 [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs] [ 540.744536] legacy_get_tree+0x38/0x90 [ 540.744537] vfs_get_tree+0x30/0x140 [ 540.744539] path_mount+0x8d8/0x1070 [ 540.744541] do_mount+0x8d/0xc0 [ 540.744543] __x64_sys_mount+0x125/0x160 [ 540.744545] do_syscall_64+0x4b/0x80 [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744551] other info that might help us debug this: [ 540.744552] Possible unsafe locking scenario: [ 540.744553] CPU0 CPU1 [ 540.744554] ---- ---- [ 540.744555] lock(btrfs-chunk-00); [ 540.744557] lock(&fs_devs->device_list_mutex); [ 540.744560] lock(btrfs-chunk-00); [ 540.744562] lock(&fs_devs->device_list_mutex); [ 540.744564] *** DEADLOCK *** [ 540.744565] 3 locks held by mount/2515: [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450 [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs] [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.744708] stack backtrace: [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5 But the device_list_mutex in clone_fs_devices() is redundant, as explained below. Two threads [1] and [2] (below) could lead to clone_fs_device(). [1] open_ctree <== mount sprout fs btrfs_read_chunk_tree() mutex_lock(&uuid_mutex) <== global lock read_one_dev() open_seed_devices() clone_fs_devices() <== seed fs_devices mutex_lock(&orig->device_list_mutex) <== seed fs_devices [2] btrfs_init_new_device() <== sprouting mutex_lock(&uuid_mutex); <== global lock btrfs_prepare_sprout() lockdep_assert_held(&uuid_mutex) clone_fs_devices(seed_fs_device) <== seed fs_devices Both of these threads hold uuid_mutex which is sufficient to protect getting the seed device(s) freed while we are trying to clone it for sprouting [2] or mounting a sprout [1] (as above). A mounted seed device can not free/write/replace because it is read-only. An unmounted seed device can be freed by btrfs_free_stale_devices(), but it needs uuid_mutex. So this patch removes the unnecessary device_list_mutex in clone_fs_devices(). And adds a lockdep_assert_held(&uuid_mutex) in clone_fs_devices(). Reported-by: Su Yue <l@damenly.su> Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3fa421de |
|
27-Jul-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: delay blkdev_put until after the device remove When removing the device we call blkdev_put() on the device once we've removed it, and because we have an EXCL open we need to take the ->open_mutex on the block device to clean it up. Unfortunately during device remove we are holding the sb writers lock, which results in the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #407 Not tainted ------------------------------------------------------ losetup/11595 is trying to acquire lock: ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_put+0x3a/0x220 btrfs_rm_device.cold+0x62/0xe5 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11595: #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc21255d4cb So instead save the bdev and do the put once we've dropped the sb writers lock in order to avoid the lockdep recursion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8f96a5bf |
|
27-Jul-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: update the bdev time directly when closing We update the ctime/mtime of a block device when we remove it so that blkid knows the device changed. However we do this by re-opening the block device and calling filp_update_time. This is more correct because it'll call the inode->i_op->update_time if it exists, but the block dev inodes do not do this. Instead call generic_update_time() on the bd_inode in order to avoid the blkdev_open path and get rid of the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #406 Not tainted ------------------------------------------------------ losetup/11596 is trying to acquire lock: ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_get_by_dev.part.0+0x56/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 file_open_name+0xc7/0x170 filp_open+0x2c/0x50 btrfs_scratch_superblocks.part.0+0x10f/0x170 btrfs_rm_device.cold+0xe8/0xed btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11596: #0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d977e0e |
|
20-Aug-2021 |
Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> |
btrfs: reset replace target device to allocation state on close This crash was observed with a failed assertion on device close: BTRFS: Transaction aborted (error -28) WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs] RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388 R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0 Call Trace: flush_space+0x197/0x2f0 [btrfs] btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs] process_one_work+0x262/0x5e0 worker_thread+0x4c/0x320 ? process_one_work+0x5e0/0x5e0 kthread+0x144/0x170 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 irq event stamp: 19334989 hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400 hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400 softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574 softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140 ---[ end trace 45939e308e0dd3c7 ]--- BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left BTRFS info (device vdd): forced readonly BTRFS warning (device vdd): failed setting block group ro: -30 BTRFS info (device vdd): suspending dev_replace for unmount assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246 RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003 FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0 Call Trace: btrfs_close_one_device.cold+0x11/0x55 [btrfs] close_fs_devices+0x44/0xb0 [btrfs] btrfs_close_devices+0x48/0x160 [btrfs] generic_shutdown_super+0x69/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x2c/0xa0 cleanup_mnt+0x144/0x1b0 task_work_run+0x59/0xa0 exit_to_user_mode_loop+0xe7/0xf0 exit_to_user_mode_prepare+0xaf/0xf0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens when close_ctree is called while a dev_replace hasn't completed. In close_ctree, we suspend the dev_replace, but keep the replace target around so that we can resume the dev_replace procedure when we mount the root again. This is the call trace: close_ctree(): btrfs_dev_replace_suspend_for_unmount(); btrfs_close_devices(): btrfs_close_fs_devices(): btrfs_close_one_device(): ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)); However, since the replace target sticks around, there is a device with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the assertion in btrfs_close_one_device. To fix this, if we come across the replace target device when closing, we should properly reset it back to allocation state. This fix also ensures that if a non-target device has a corrupted state and has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still catch the error. Reported-by: David Sterba <dsterba@suse.com> Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4571b8c |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix NULL pointer dereference when deleting device by invalid id [BUG] It's easy to trigger NULL pointer dereference, just by removing a non-existing device id: # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \ /dev/test/scratch2 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device remove 3 /mnt/btrfs Then we have the following kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs] btrfs_ioctl+0x18bb/0x3190 [btrfs] ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 ? do_user_addr_fault+0x201/0x6a0 ? lock_release+0xd2/0x2d0 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae [CAUSE] Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") moves the "missing" device path check into btrfs_rm_device(). But btrfs_rm_device() itself can have case where it only receives @devid, with NULL as @device_path. In that case, calling strcmp() on NULL will trigger the NULL pointer dereference. Before that commit, we handle the "missing" case inside btrfs_find_device_by_devspec(), which will not check @device_path at all if @devid is provided, thus no way to trigger the bug. [FIX] Before calling strcmp(), also make sure @device_path is not NULL. Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") CC: stable@vger.kernel.org # 5.4+ Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ff40a91 |
|
29-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efc222f8 |
|
27-Jul-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify return values in btrfs_check_raid_min_devices Function btrfs_check_raid_min_devices() returns error code from the enum btrfs_err_code and it starts from 1. So there is no need to check if ret is > 0. So drop this check and also drop the local variable ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2f78e88 |
|
22-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: allow degenerate raid0/raid10 The data on raid0 and raid10 are supposed to be spread over multiple devices, so the minimum constraints are set to 2 and 4 respectively. This is an artificial limit and there's some interest to remove it. Change this to allow raid0 on one device and raid10 on two devices. This works as expected eg. when converting or removing devices. The only difference is when raid0 on two devices gets one device removed. Unpatched would silently create a single profile, while newly it would be raid0. The motivation is to allow to preserve the profile type as long as it possible for some intermediate state (device removal, conversion), or when there are disks of different size, with raid0 the otherwise unusable space of the last device will be used too. Similarly for raid10, though the two largest devices would need to be the same. Unpatched kernel will mount and use the degenerate profiles just fine but won't allow any operation that would not satisfy the stricter device number constraints, eg. not allowing to go from 3 to 2 devices for raid10 or various profile conversions. Example output: # btrfs fi us -T . Overall: Device size: 10.00GiB Device allocated: 1.01GiB Device unallocated: 8.99GiB Device missing: 0.00B Used: 200.61MiB Free (estimated): 9.79GiB (min: 9.79GiB) Free (statfs, df): 9.79GiB Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 3.25MiB (used: 0.00B) Multiple profiles: no Data Metadata System Id Path RAID0 single single Unallocated -- ---------- --------- --------- -------- ----------- 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB -- ---------- --------- --------- -------- ----------- Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB Used 200.25MiB 352.00KiB 16.00KiB # btrfs dev us . /dev/sda10, ID: 1 Device size: 10.00GiB Device slack: 0.00B Data,RAID0/1: 1.00GiB Metadata,single: 8.00MiB System,single: 1.00MiB Unallocated: 8.99GiB Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per profile is printed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8050b3b |
|
26-Jul-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: subpage: reject raid56 filesystem and profile conversion RAID56 is not only unsafe due to its write-hole problem, but also has tons of hardcoded PAGE_SIZE. Disable it for subpage support for now. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
214cc184 |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: constify and cleanup variables in comparators Comparators just read the data and thus get const parameters. This should be also preserved by the local variables, update all comparators passed to sort or bsearch. Cleanups: - unnecessary casts are dropped - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern and 'inline' is dropped as the function address is taken Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d58ede8d |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: simplify data stripe calculation helpers There are two helpers doing the same calculations based on nparity and ncopies. calc_data_stripes can be simplified into one expression, so far we don't have profile with both copies and parity, so there's no effective change. calc_stripe_length should reuse the helper and not repeat the same calculation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fe4f46d4 |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: merge alloc_device helpers The device allocation is split to two functions, but one just calls the other and they're very far in the file. Merge them together. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
500a44c9 |
|
26-Jul-2021 |
David Sterba <dsterba@suse.com> |
btrfs: uninline btrfs_bg_flags_to_raid_index The helper does a simple translation from block group flags to index to the btrfs_raid_array table. There's no apparent reason to inline the function, the translation happens usually once per function and is not called in a loop. Making it a proper function saves quite some binary code (x86_64, release config): text data bss dec hex filename 1164011 19253 14912 1198176 124860 pre/btrfs.ko 1161559 19253 14912 1195724 123ecc post/btrfs.ko DELTA: -2451 Also add the const attribute as there are no side effects, this could help compiler to optimize a few things without the function body. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad9a9378 |
|
13-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritems After calling btrfs_search_slot is a common practice to check if the slot found isn't bigger than number of slots in the current leaf, and if so, search for the same key in the next leaf by calling btrfs_next_leaf, which calls btrfs_next_old_leaf to do the job. Calling btrfs_next_item in the same situation would end up in the same code flow, since * btrfs_next_item * btrfs_next_old_item * if slot >= nritems(curr_leaf) btrfs_next_old_leaf Change btrfs_verify_dev_extents and calculate_emulated_zone_size functions to use btrfs_next_leaf in the same situation. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2eadb9e7 |
|
04-Jul-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_finish_chunk_alloc private to block-group.c One of the final things that must be done to add a new chunk is inserting its device extent items in the device tree. They describe the portion of allocated device physical space during phase 1 of chunk allocation. This is currently done in btrfs_finish_chunk_alloc whose name isn't very informative. What's more, this function is only used in block-group.c but is defined as public. There isn't anything special about it that would warrant it being defined in volumes.c. Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to block-group.c, make the former static and rename both functions to insert_dev_extents and insert_dev_extent respectively. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2a61667 |
|
27-Jul-2021 |
Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> |
btrfs: fix rw device counting in __btrfs_free_extra_devids When removing a writeable device in __btrfs_free_extra_devids, the rw device count should be decremented. This error was caught by Syzbot which reported a warning in close_fs_devices: WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168 Modules linked in: CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168 RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293 RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128 R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000 R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180 open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693 btrfs_fill_super fs/btrfs/super.c:1382 [inline] btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1498 fc_mount fs/namespace.c:993 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023 btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0x196f/0x2be0 fs/namespace.c:3235 do_mount fs/namespace.c:3248 [inline] __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae Because fs_devices->rw_devices was not 0 after closing all devices. Here is the call trace that was observed: btrfs_mount_root(): btrfs_scan_one_device(): device_list_add(); <---------------- device added btrfs_open_devices(): open_fs_devices(): btrfs_open_one_device(); <-------- writable device opened, rw device count ++ btrfs_fill_super(): open_ctree(): btrfs_free_extra_devids(): __btrfs_free_extra_devids(); <--- writable device removed, rw device count not decremented fail_tree_roots: btrfs_close_devices(): close_fs_devices(); <------- rw device count off by 1 As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device"), rw_devices was decremented on removing a writable device in __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit was not set for the device. However, this check does not need to be reinstated as it is now redundant and incorrect. In __btrfs_free_extra_devids, we skip removing the device if it is the target for replacement. This is done by checking whether device->devid == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check, and so it's redundant to test for that bit. Additionally, following commit 82372bc816d7 ("Btrfs: make the logic of source device removing more clear"), rw_devices is incremented whenever a writeable device is added to the alloc list (including the target device in btrfs_dev_replace_finishing), so all removals of writable devices from the alloc list should also be accompanied by a decrement to rw_devices. Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device") CC: stable@vger.kernel.org # 5.10+ Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79bd3712 |
|
29-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a8698707 |
|
25-May-2021 |
Christoph Hellwig <hch@lst.de> |
block: move bd_mutex to struct gendisk Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
1cea5cf0 |
|
21-Jun-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure relocation never runs while we have send operations running Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a9fd417 |
|
21-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: fix typos in comments Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43c0d1a5 |
|
13-Apr-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() The parameter @len is not really used in btrfs_bio_fits_in_stripe(), just remove it. It got removed in 420343131970 ("btrfs: let callers of btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map utilized it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
94358c35 |
|
25-Apr-2021 |
Su Yue <l@damenly.su> |
btrfs: remove stale comment for argument seed of btrfs_find_device Commit b2598edf8b36 ("btrfs: remove unused argument seed from btrfs_find_device") removed the argument seed from btrfs_find_device but forgot the comment, so remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d6f67afb |
|
10-May-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error Commit 7000babddac6 ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned") assigned false to the hole_start parameter of dev_extent_hole_check_zoned(). The hole_start parameter is not boolean and returns the start location of the found hole. Fixes: 7000babddac6 ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4f0f586b |
|
08-Apr-2021 |
Sami Tolvanen <samitolvanen@google.com> |
treewide: Change list_sort to use const pointers list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
|
#
18bb8bbf |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: automatically reclaim zones When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3372065 |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
01e86008 |
|
19-Apr-2021 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: reset zones of relocated block groups When relocating a block group the freed up space is not discarded in one big block, but each extent is discarded on its own with -odisard=sync. For a zoned filesystem we need to discard the whole block group at once, so btrfs_discard_extent() will translate the discard into a REQ_OP_ZONE_RESET operation, which then resets the device's zone. Failure to reset the zone is not fatal error. Discussion about the approach and regarding transaction blocking: https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9306ad4 |
|
24-Feb-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: more graceful errors/warnings on 32bit systems when reaching limits Btrfs uses internally mapped u64 address space for all its metadata. Due to the page cache limit on 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K page size while 256T for 64K page size. Users can have a filesystem which doesn't have metadata beyond the boundary at mount time, but later balance can cause it to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. We can't do more than to prevent mounting such filesystem or warn early when the numbers are still within the limits. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8th of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8th of the max file size is allocated, do an early warning. Above error/warning message will only be printed once for each fs to reduce dmesg flood. If the mount is rejected, the filesystem will be mountable only on a 64bit host. Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/ Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb05b298 |
|
23-Mar-2021 |
Arnd Bergmann <arnd@arndb.de> |
btrfs: zoned: bail out in btrfs_alloc_chunk for bad input gcc complains that the ctl->max_chunk_size member might be used uninitialized when none of the three conditions for initializing it in init_alloc_chunk_ctl_policy_zoned() are true: In function ‘init_alloc_chunk_ctl_policy_zoned’, inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3, inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2: include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized] 4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size); | ^~~ fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’: fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here 5316 | struct alloc_chunk_ctl ctl; | ^~~ If we ever get into this condition, something is seriously wrong, as validity is checked in the callers btrfs_alloc_chunk init_alloc_chunk_ctl init_alloc_chunk_ctl_policy_zoned so the same logic as in init_alloc_chunk_ctl_policy_regular() and a few other places should be applied. This avoids both further data corruption, and the compile-time warning. Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7000babd |
|
03-Mar-2021 |
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> |
btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned Fix the following coccicheck warnings: ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function 'dev_extent_hole_check_zoned' with return type bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82d62d06 |
|
11-Mar-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not initialize dev stats if we have no dev_root Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1 RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0 RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10 RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000 RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000 R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00 R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10 FS: 00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 btrfs_init_dev_stats+0x62/0xf0 open_ctree+0x1019/0x15ff btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f698c04e52e This happens because we unconditionally attempt to initialize device stats on mount, but we may not have been able to read the device root. Fix this by skipping initializing the device stats if we do not have a device root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7ef5287 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems When a bad checksum is found and if the filesystem has a mirror of the damaged data, we read the correct data from the mirror and writes it to damaged blocks. This however, violates the sequential write constraints of a zoned block device. We can consider three methods to repair an IO failure in zoned filesystems: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group and is straightforward to implement. It relocates all the mirrored device extents, so it potentially is a more costly operation than method (1) or (2). But it relocates only used extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de17addc |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement copying for zoned device-replace This is 3/4 patch to implement device-replace on zoned filesystems. This commit implements copying. To do this, it tracks the write pointer during the device replace process. As device-replace's copy process is smart enough to only copy used extents on the source device, we have to fill the gap to honor the sequential write requirement in the target device. The device-replace process on zoned filesystems must copy or clone all the extents in the source device exactly once. So, we need to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6143c23c |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement cloning for zoned device-replace This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8e3fb10 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: use ZONE_APPEND write for zoned mode Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cfe94440 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual devices. Let btrfs_end_bio() and btrfs_op be aware of it, by mapping REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of bio_op(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
381a696e |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: verify device extent is aligned to zone Add a check in verify_one_dev_extent() to ensure that a device extent on a zoned block device is aligned to the respective zone boundary. If it isn't, mark the filesystem as unclean. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cd6121f |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: implement zoned chunk allocator Implement a zoned chunk and device extent allocator. One device zone becomes a device extent so that a zone reset affects only this device extent and does not change the state of blocks in the neighbor device extents. To implement the allocator, we need to extend the following functions for a zoned filesystem. - init_alloc_chunk_ctl - dev_extent_search_start - dev_extent_hole_check - decide_stripe_size init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always set the stripe_size to the zone size and aligns the parameters to the zone size. dev_extent_search_start() only aligns the start offset to zone boundaries. We don't care about the first 1MB like in regular filesystem because we anyway reserve the first two zones for superblock logging. dev_extent_hole_check_zoned() checks if zones in given hole are either conventional or empty sequential zones. Also, it skips zones reserved for superblock logging. With the change to the hole, the new hole may now contain pending extents. So, in this case, loop again to check that. Finally, decide_stripe_size_zoned() should shrink the number of devices instead of stripe size because we need to honor stripe_size == zone_size. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73651042 |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: defer loading zone info after opening trees This is a preparation patch to implement zone emulation on a regular device. To emulate a zoned filesystem on a regular (non-zoned) device, we need to decide an emulated zone size. Instead of making it a compile-time static value, we'll make it configurable at mkfs time. Since we have one zone == one device extent restriction, we can determine the emulated zone size from the size of a device extent. We can extend btrfs_get_dev_zone_info() to show a regular device filled with conventional zones once the zone size is decided. The current call site of btrfs_get_dev_zone_info() during the mount process is earlier than loading the file system trees so that we don't know the size of a device extent at this point. Thus we can't slice a regular device to conventional zones. This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone info for all the devices. And, it places this function in open_ctree() after loading the trees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42034313 |
|
27-Jan-2021 |
Michal Rostecki <mrostecki@suse.com> |
btrfs: let callers of btrfs_get_io_geometry pass the em Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. Change that to passing the extent mapping to the btrfs_get_io_geometry() function as an argument. This could improve performance in some cases. For very large filesystems, i.e. several thousands of allocated chunks, not only this avoids searching two times the rbtree, saving time, it may also help reducing contention on the lock that protects the tree - thinking of writeback starting for multiple inodes, other tasks allocating or removing chunks, and anything else that requires access to the rbtree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Michal Rostecki <mrostecki@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add Filipe's analysis ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7056bf69 |
|
17-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device Instead of having three 'if' to handle non-NULL return value consolidate this in one 'if (ret)'. That way the code is more obvious: - Always drop delete_unused_bgs_mutex if ret is not NULL - If ret is negative -> goto done - If it's 1 -> reset ret to 0, release the path and finish the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
616c6a68 |
|
26-Jan-2021 |
Christoph Hellwig <hch@lst.de> |
btrfs: use bio_kmalloc in __alloc_device Use bio_kmalloc instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
c41ec452 |
|
21-Jan-2021 |
Su Yue <l@damenly.su> |
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch This effectively reverts commit d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t"). While running fstests on 32 bits test box, many tests failed because of warnings in dmesg. One of those warnings (btrfs/003): [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs] [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5 [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs] [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803 [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70 [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246 [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0 [66.441485] Call Trace: [66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs] [66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs] [66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs] [66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs] [66.441619] ? __this_cpu_preempt_check+0xf/0x11 [66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs] [66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs] [66.441700] ? __lock_acquire+0x35f/0x2650 [66.441717] ? lockdep_hardirqs_on+0x87/0x120 [66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0 [66.441724] ? call_rcu+0x2d3/0x530 [66.441731] ? __might_fault+0x41/0x90 [66.441736] ? kvm_sched_clock_read+0x15/0x50 [66.441740] ? sched_clock+0x8/0x10 [66.441745] ? sched_clock_cpu+0x13/0x180 [66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441768] __ia32_sys_ioctl+0x165/0x8a0 [66.441773] ? __this_cpu_preempt_check+0xf/0x11 [66.441785] ? __might_fault+0x89/0x90 [66.441791] __do_fast_syscall_32+0x54/0x80 [66.441796] do_fast_syscall_32+0x32/0x70 [66.441801] do_SYSENTER_32+0x15/0x20 [66.441805] entry_SYSENTER_32+0x9f/0xf2 [66.441808] EIP: 0xab7b5549 [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98 [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [66.441833] irq event stamp: 42579 [66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590 [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590 [66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60 [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60 ======================================================================== btrfs_remove_chunk+0x58b/0x7b0: __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279 (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212 (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994 ======================================================================== The warning is produced by lockdep_assert_held() in __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled. And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock fs_info->chunk_mutex held already. After adding some debug prints, the cause was found that many __alloc_device() are called with NULL @fs_info (during scanning ioctl). Inside the function, btrfs_device_data_ordered_init() is expanded to seqcount_mutex_init(). In this scenario, its second parameter info->chunk_mutex is &NULL->chunk_mutex which equals to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus, seqcount_mutex_init() is called in wrong way. And later btrfs_device_get/set helpers trigger lockdep warnings. The device and filesystem object lifetimes are different and we'd have to synchronize initialization of the btrfs_device::data_seqcount with the fs_info, possibly using some additional synchronization. It would still not prevent concurrent access to the seqcount lock when it's used for read and initialization. Commit d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t") does not mention a particular problem being fixed so revert should not cause any harm and we'll get the lockdep warning fixed. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139 Reported-by: Erhard F <erhard_f@mailbox.org> Fixes: d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t") CC: stable@vger.kernel.org # 5.10 CC: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb286100 |
|
16-Dec-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix lockdep splat in btrfs_recover_relocation While testing the error paths of relocation I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc6+ #217 Not tainted ------------------------------------------------------ mount/779 is trying to acquire lock: ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340 but task is already holding lock: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x462/0x8f0 btrfs_update_root+0x55/0x2b0 btrfs_drop_snapshot+0x398/0x750 clean_dirty_subvols+0xdf/0x120 btrfs_recover_relocation+0x534/0x5a0 btrfs_start_pre_rw_mount+0xcb/0x170 open_ctree+0x151f/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (sb_internal#2){.+.+}-{0:0}: start_transaction+0x444/0x700 insert_balance_item.isra.0+0x37/0x320 btrfs_balance+0x354/0xf40 btrfs_ioctl_balance+0x2cf/0x380 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}: __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 __mutex_lock+0x7e/0x7b0 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(sb_internal#2); lock(btrfs-root-00); lock(&fs_info->balance_mutex); *** DEADLOCK *** 2 locks held by mount/779: #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 stack backtrace: CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 ? btrfs_recover_balance+0x2f0/0x340 __mutex_lock+0x7e/0x7b0 ? btrfs_recover_balance+0x2f0/0x340 ? btrfs_recover_balance+0x2f0/0x340 ? rcu_read_lock_sched_held+0x3f/0x80 ? kmem_cache_alloc_trace+0x2c4/0x2f0 ? btrfs_get_64+0x5e/0x100 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea ? rcu_read_lock_sched_held+0x3f/0x80 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? __kmalloc_track_caller+0x2f2/0x320 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 ? capable+0x3a/0x60 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is straightforward to fix, simply release the path before we setup the balance_ctl. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0a1db70 |
|
14-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between RO remount and the cleaner task When we are remounting a filesystem in RO mode we can race with the cleaner task and result in leaking a transaction if the filesystem is unmounted shortly after, before the transaction kthread had a chance to commit that transaction. That also results in a crash during unmount, due to a use-after-free, if hardware acceleration is not available for crc32c. The following sequence of steps explains how the race happens. 1) The filesystem is mounted in RW mode and the cleaner task is running. This means that currently BTRFS_FS_CLEANER_RUNNING is set at fs_info->flags; 2) The cleaner task is currently running delayed iputs for example; 3) A filesystem RO remount operation starts; 4) The RO remount task calls btrfs_commit_super(), which commits any currently open transaction, and it finishes; 5) At this point the cleaner task is still running and it creates a new transaction by doing one of the following things: * When running the delayed iput() for an inode with a 0 link count, in which case at btrfs_evict_inode() we start a transaction through the call to evict_refill_and_join(), use it and then release its handle through btrfs_end_transaction(); * When deleting a dead root through btrfs_clean_one_deleted_snapshot(), a transaction is started at btrfs_drop_snapshot() and then its handle is released through a call to btrfs_end_transaction_throttle(); * When the remount task was still running, and before the remount task called btrfs_delete_unused_bgs(), the cleaner task also called btrfs_delete_unused_bgs() and it picked and removed one block group from the list of unused block groups. Before the cleaner task started a transaction, through btrfs_start_trans_remove_block_group() at btrfs_delete_unused_bgs(), the remount task had already called btrfs_commit_super(); 6) So at this point the filesystem is in RO mode and we have an open transaction that was started by the cleaner task; 7) Shortly after a filesystem unmount operation starts. At close_ctree() we stop the transaction kthread before it had a chance to commit the transaction, since less than 30 seconds (the default commit interval) have elapsed since the last transaction was committed; 8) We end up calling iput() against the btree inode at close_ctree() while there is an open transaction, and since that transaction was used to update btrees by the cleaner, we have dirty pages in the btree inode due to COW operations on metadata extents, and therefore writeback is triggered for the btree inode. So btree_write_cache_pages() is invoked to flush those dirty pages during the final iput() on the btree inode. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 So fix this by making the remount path to wait for the cleaner task before calling btrfs_commit_super(). The remount path now waits for the bit BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling btrfs_commit_super() and this ensures the cleaner can not start a transaction after that, because it sleeps when the filesystem is in RO mode and we have already flagged the filesystem as RO before waiting for BTRFS_FS_CLEANER_RUNNING to be cleared. This also introduces a new flag BTRFS_FS_STATE_RO to be used for fs_info->fs_state when the filesystem is in RO mode. This is because we were doing the RO check using the flags of the superblock and setting the RO mode simply by ORing into the superblock's flags - those operations are not atomic and could result in the cleaner not seeing the update from the remount task after it clears BTRFS_FS_CLEANER_RUNNING. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1201b58b |
|
26-Nov-2020 |
David Sterba <dsterba@suse.com> |
btrfs: drop casts of bio bi_sector Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12659251 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: implement log-structured superblock for ZONED mode Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b70f5097 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: check and enable ZONED mode Introduce function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5b316468 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: get zone information of zoned block devices If a zoned block device is found, get its zone information (number of zones and zone size). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2598edf |
|
02-Nov-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove unused argument seed from btrfs_find_device Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to check if the parameter seed is true in the function btrfs_find_device(). This tells it whether to traverse the seed device list or not. After this commit, the argument is unused and can be removed. In device_list_add() it's not necessary because fs_devices always points to the device's fs_devices. So with the devid+uuid matching, it will find the right device and return, thus not needing to traverse seed devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3a160a93 |
|
02-Nov-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop never met disk total bytes check in verify_one_dev_extent Drop the condition in verify_one_dev_extent, btrfs_device::disk_total_bytes is set even for a seed device. The comment is wrong, the size is properly set when cloning the device. Commit 1b3922a8bc74 ("btrfs: Use real device structure to verify dev extent") introduced it but it's unclear why the total_disk_bytes was 0. Theoretically, all devices (including missing and seed) marked with the BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at fill_device_from_item(): open_ctree() btrfs_read_chunk_tree() read_one_dev() open_seed_device() fill_device_from_item() Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its a bug to be fixed somewhere else and not in verify_one_dev_extent() as it's just a messenger. It is never expected that a total_disk_bytes shall be zero. The function fill_device_from_item() does the job of reading it from the item and updating btrfs_device::disk_total_bytes. So both the missing device and the seed devices do have their disk_total_bytes updated. btrfs_find_device can also return a device from fs_info->seed_list because it searches it as well. Furthermore, while removing the device if there is a power loss, we could have a device with its total_bytes = 0, that's still valid. Instead, introduce a check against maximum block device size in read_one_dev(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bacce86a |
|
06-Nov-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop unused argument step from btrfs_free_extra_devids Commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device") dropped the multi stage operation of btrfs_free_extra_devids() that does not need to check replace target anymore and we can remove the 'step' argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e114c545 |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: set the lockdep class for extent buffers on creation Both Filipe and Fedora QA recently hit the following lockdep splat: WARNING: possible recursive locking detected 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted -------------------------------------------- rsync/2610 is trying to acquire lock: ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 but task is already holding lock: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&eb->lock); lock(&eb->lock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by rsync/2610: #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190 #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 stack backtrace: CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x12d/0x2a4 ? kvm_sched_clock_read+0x14/0x30 ? sched_clock+0x5/0x10 lock_acquire+0xc8/0x400 ? btrfs_tree_read_lock_atomic+0x34/0x140 ? read_block_for_search.isra.0+0xdd/0x320 _raw_read_lock+0x3d/0xa0 ? btrfs_tree_read_lock_atomic+0x34/0x140 btrfs_tree_read_lock_atomic+0x34/0x140 btrfs_search_slot+0x616/0x9a0 btrfs_lookup_dir_item+0x6c/0xb0 btrfs_lookup_dentry+0xa8/0x520 ? lockdep_init_map_waits+0x4c/0x210 btrfs_lookup+0xe/0x30 __lookup_slow+0x10f/0x1e0 walk_component+0x11b/0x190 path_lookupat+0x72/0x1c0 filename_lookup+0x97/0x180 ? strncpy_from_user+0x96/0x1e0 ? getname_flags.part.0+0x45/0x1a0 vfs_statx+0x64/0x100 ? lockdep_hardirqs_on_prepare+0xff/0x180 ? _raw_spin_unlock_irqrestore+0x41/0x50 __do_sys_newlstat+0x26/0x40 ? lockdep_hardirqs_on_prepare+0xff/0x180 ? syscall_enter_from_user_mode+0x27/0x80 ? syscall_enter_from_user_mode+0x27/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I have also seen a report of lockdep complaining about the lock class that was looked up being the same as the lock class on the lock we were using, but I can't find the report. These are problems that occur because we do not have the lockdep class set on the extent buffer until _after_ we read the eb in properly. This is problematic for concurrent readers, because we will create the extent buffer, lock it, and then attempt to read the extent buffer. If a second thread comes in and tries to do a search down the same path they'll get the above lockdep splat because the class isn't set properly on the extent buffer. There was a good reason for this, we generally didn't know the real owner of the eb until we read it, specifically in refcounted roots. However now all refcounted roots have the same class name, so we no longer need to worry about this. For non-refcounted trees we know which root we're on based on the parent. Fix this by setting the lockdep class on the eb at creation time instead of read time. This will fix the splat and the weirdness where the class changes in the middle of locking the block. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3fbaf258 |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass the owner_root and level to alloc_extent_buffer Now that we've plumbed all of the callers to have the owner root and the level, plumb it down into alloc_extent_buffer(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bfb484d9 |
|
05-Nov-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: cleanup extent buffer readahead We're going to pass around more information when we allocate extent buffers, in order to make that cleaner how we do readahead. Most of the callers have the parent node that we're getting our blockptr from, with the sole exception of relocation which simply has the bytenr it wants to read. Add a helper that takes the current arguments that we need (bytenr and gen), and add another helper for simply reading the slot out of a node. In followup patches the helper that takes all the extra arguments will be expanded, and the simpler helper won't need to have it's arguments adjusted. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
33fd2f71 |
|
28-Oct-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: create read policy framework As of now, we use the pid method to read striped mirrored data, which means process id determines the stripe id to read. This type of routing typically helps in a system with many small independent processes tying to read random data. On the other hand, the pid based read IO policy is inefficient because if there is a single process trying to read a large file, the overall disk bandwidth remains underutilized. So this patch introduces a read policy framework so that we could add more read policies, such as IO routing based on the device's wait-queue or manual when we have a read-preferred device or a policy based on the target storage caching. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42437a63 |
|
16-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce mount option rescue=ignorebadroots In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e7b5671 |
|
23-Nov-2020 |
Christoph Hellwig <hch@lst.de> |
block: remove i_bdev Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0697d9a6 |
|
18-Nov-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: don't access possibly stale fs_info data for printing duplicate device Syzbot reported a possible use-after-free when printing a duplicate device warning device_list_add(). At this point it can happen that a btrfs_device::fs_info is not correctly setup yet, so we're accessing stale data, when printing the warning message using the btrfs_printk() wrappers. ================================================================== BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068 CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943 btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359 btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x44840a RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630 RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003 Allocated by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] kvmalloc_node+0x81/0x110 mm/util.c:574 kvmalloc include/linux/mm.h:757 [inline] kvzalloc include/linux/mm.h:765 [inline] btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x3d/0x70 mm/kasan/common.c:56 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x113/0x200 mm/slab.c:3756 deactivate_locked_super+0xa7/0xf0 fs/super.c:335 btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff8880878e0000 which belongs to the cache kmalloc-16k of size 16384 The buggy address is located 1704 bytes inside of 16384-byte region [ffff8880878e0000, ffff8880878e4000) The buggy address belongs to the page: page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0 head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfffe0000010200(slab|head) raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00 raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The syzkaller reproducer for this use-after-free crafts a filesystem image and loop mounts it twice in a loop. The mount will fail as the crafted image has an invalid chunk tree. When this happens btrfs_mount_root() will call deactivate_locked_super(), which then cleans up fs_info and fs_info::sb. If a second thread now adds the same block-device to the filesystem, it will get detected as a duplicate device and device_list_add() will reject the duplicate and print a warning. But as the fs_info pointer passed in is non-NULL this will result in a use-after-free. Instead of printing possibly uninitialized or already freed memory in btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the device name will be skipped altogether. There was a slightly different approach discussed in https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf89af14 |
|
29-Oct-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: dev-replace: fail mount if we don't have replace item with target device If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace item, then it means the filesystem is inconsistent state. This is either corruption or a crafted image. Fail the mount as this needs a closer look what is actually wrong. As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace item, in __btrfs_free_extra_devids() we determine that there is an extra device, and free those extra devices but continue to mount the device. However, we were wrong in keeping tack of the rw_devices so the syzbot testcase failed: WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 RSP: 0018:ffffc900091777e0 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000 RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007 RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130 R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 The fix here is, when we determine that there isn't a replace item then fail the mount if there is a replace target device (devid 0). CC: stable@vger.kernel.org # 4.19+ Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d5c82388 |
|
21-Oct-2020 |
Davidlohr Bueso <dave@stgolabs.net> |
btrfs: convert data_seqcount to seqcount_mutex_t By doing so we can associate the sequence counter to the chunk_mutex for lockdep purposes (compiled-out otherwise), the mutex is otherwise used on the write side. Also avoid explicitly disabling preemption around the write region as it will now be done automatically by the seqcount machinery based on the lock type. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66d204a1 |
|
12-Oct-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix readahead hang and use-after-free after removing a device Very sporadically I had test case btrfs/069 from fstests hanging (for years, it is not a recent regression), with the following traces in dmesg/syslog: [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds. [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000 [162513.514751] Call Trace: [162513.514761] __schedule+0x5ce/0xd00 [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.514771] schedule+0x46/0xf0 [162513.514844] wait_current_trans+0xde/0x140 [btrfs] [162513.514850] ? finish_wait+0x90/0x90 [162513.514864] start_transaction+0x37c/0x5f0 [btrfs] [162513.514879] transaction_kthread+0xa4/0x170 [btrfs] [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs] [162513.514894] kthread+0x153/0x170 [162513.514897] ? kthread_stop+0x2c0/0x2c0 [162513.514902] ret_from_fork+0x22/0x30 [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds. [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000 [162513.515682] Call Trace: [162513.515688] __schedule+0x5ce/0xd00 [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.515697] schedule+0x46/0xf0 [162513.515712] wait_current_trans+0xde/0x140 [btrfs] [162513.515716] ? finish_wait+0x90/0x90 [162513.515729] start_transaction+0x37c/0x5f0 [btrfs] [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20 [162513.515761] iterate_supers+0x87/0xf0 [162513.515765] ksys_sync+0x60/0xb0 [162513.515768] __do_sys_sync+0xa/0x10 [162513.515771] do_syscall_64+0x33/0x80 [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.515781] RIP: 0033:0x7f5238f50bd7 [162513.515782] Code: Bad RIP value. [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0 [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds. [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000 [162513.516620] Call Trace: [162513.516625] __schedule+0x5ce/0xd00 [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.516634] schedule+0x46/0xf0 [162513.516647] wait_current_trans+0xde/0x140 [btrfs] [162513.516650] ? finish_wait+0x90/0x90 [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs] [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs] [162513.516686] __vfs_setxattr+0x66/0x80 [162513.516691] __vfs_setxattr_noperm+0x70/0x200 [162513.516697] vfs_setxattr+0x6b/0x120 [162513.516703] setxattr+0x125/0x240 [162513.516709] ? lock_acquire+0xb1/0x480 [162513.516712] ? mnt_want_write+0x20/0x50 [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0 [162513.516723] ? preempt_count_add+0x49/0xa0 [162513.516725] ? __sb_start_write+0x19b/0x290 [162513.516727] ? preempt_count_add+0x49/0xa0 [162513.516732] path_setxattr+0xba/0xd0 [162513.516739] __x64_sys_setxattr+0x27/0x30 [162513.516741] do_syscall_64+0x33/0x80 [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.516745] RIP: 0033:0x7f5238f56d5a [162513.516746] Code: Bad RIP value. [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470 [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700 [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004 [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0 [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds. [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000 [162513.517780] Call Trace: [162513.517786] __schedule+0x5ce/0xd00 [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.517796] schedule+0x46/0xf0 [162513.517810] wait_current_trans+0xde/0x140 [btrfs] [162513.517814] ? finish_wait+0x90/0x90 [162513.517829] start_transaction+0x37c/0x5f0 [btrfs] [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20 [162513.517865] iterate_supers+0x87/0xf0 [162513.517869] ksys_sync+0x60/0xb0 [162513.517872] __do_sys_sync+0xa/0x10 [162513.517875] do_syscall_64+0x33/0x80 [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.517881] RIP: 0033:0x7f5238f50bd7 [162513.517883] Code: Bad RIP value. [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053 [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0 [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053 [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds. [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000 [162513.519160] Call Trace: [162513.519165] __schedule+0x5ce/0xd00 [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.519174] schedule+0x46/0xf0 [162513.519190] wait_current_trans+0xde/0x140 [btrfs] [162513.519193] ? finish_wait+0x90/0x90 [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs] [162513.519222] btrfs_create+0x57/0x200 [btrfs] [162513.519230] lookup_open+0x522/0x650 [162513.519246] path_openat+0x2b8/0xa50 [162513.519270] do_filp_open+0x91/0x100 [162513.519275] ? find_held_lock+0x32/0x90 [162513.519280] ? lock_acquired+0x33b/0x470 [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0 [162513.519287] ? _raw_spin_unlock+0x29/0x40 [162513.519295] do_sys_openat2+0x20d/0x2d0 [162513.519300] do_sys_open+0x44/0x80 [162513.519304] do_syscall_64+0x33/0x80 [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.519309] RIP: 0033:0x7f5238f4a903 [162513.519310] Code: Bad RIP value. [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903 [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470 [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002 [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013 [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620 [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds. [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002 [162513.520511] Call Trace: [162513.520516] __schedule+0x5ce/0xd00 [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.520525] schedule+0x46/0xf0 [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs] [162513.520548] ? finish_wait+0x90/0x90 [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs] [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs] [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs] [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs] [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs] [162513.520643] ? do_sigaction+0xf3/0x240 [162513.520645] ? find_held_lock+0x32/0x90 [162513.520648] ? do_sigaction+0xf3/0x240 [162513.520651] ? lock_acquired+0x33b/0x470 [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50 [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100 [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50 [162513.520662] ? do_sigaction+0xf3/0x240 [162513.520671] ? __x64_sys_ioctl+0x83/0xb0 [162513.520672] __x64_sys_ioctl+0x83/0xb0 [162513.520677] do_syscall_64+0x33/0x80 [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.520681] RIP: 0033:0x7fc3cd307d87 [162513.520682] Code: Bad RIP value. [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87 [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003 [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003 [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001 [162513.520703] Showing all locks held in the system: [162513.520712] 1 lock held by khungtaskd/54: [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197 [162513.520728] 1 lock held by in:imklog/596: [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60 [162513.520782] 1 lock held by btrfs-transacti/1356167: [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs] [162513.520798] 1 lock held by btrfs/1356190: [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60 [162513.520805] 1 lock held by fsstress/1356184: [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520811] 3 locks held by fsstress/1356185: [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120 [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520833] 1 lock held by fsstress/1356196: [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520838] 3 locks held by fsstress/1356197: [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50 [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520858] 2 locks held by btrfs/1356211: [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs] [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] This was weird because the stack traces show that a transaction commit, triggered by a device replace operation, is blocking trying to pause any running scrubs but there are no stack traces of blocked tasks doing a scrub. After poking around with drgn, I noticed there was a scrub task that was constantly running and blocking for shorts periods of time: >>> t = find_task(prog, 1356190) >>> prog.stack_trace(t) #0 __schedule+0x5ce/0xcfc #1 schedule+0x46/0xe4 #2 schedule_timeout+0x1df/0x475 #3 btrfs_reada_wait+0xda/0x132 #4 scrub_stripe+0x2a8/0x112f #5 scrub_chunk+0xcd/0x134 #6 scrub_enumerate_chunks+0x29e/0x5ee #7 btrfs_scrub_dev+0x2d5/0x91b #8 btrfs_ioctl+0x7f5/0x36e7 #9 __x64_sys_ioctl+0x83/0xb0 #10 do_syscall_64+0x33/0x77 #11 entry_SYSCALL_64+0x7c/0x156 Which corresponds to: int btrfs_reada_wait(void *handle) { struct reada_control *rc = handle; struct btrfs_fs_info *fs_info = rc->fs_info; while (atomic_read(&rc->elems)) { if (!atomic_read(&fs_info->reada_works_cnt)) reada_start_machine(fs_info); wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, (HZ + 9) / 10); } (...) So the counter "rc->elems" was set to 1 and never decreased to 0, causing the scrub task to loop forever in that function. Then I used the following script for drgn to check the readahead requests: $ cat dump_reada.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) def dump_re(re): nzones = re.nzones.value_() print(f're at {hex(re.value_())}') print(f'\t logical {re.logical.value_()}') print(f'\t refcnt {re.refcnt.value_()}') print(f'\t nzones {nzones}') for i in range(nzones): dev = re.zones[i].device name = dev.name.str.string_() print(f'\t\t dev id {dev.devid.value_()} name {name}') print() for _, e in radix_tree_for_each(fs_info.reada_tree): re = cast('struct reada_extent *', e) dump_re(re) $ drgn dump_reada.py re at 0xffff8f3da9d25ad8 logical 38928384 refcnt 1 nzones 1 dev id 0 name b'/dev/sdd' $ So there was one readahead extent with a single zone corresponding to the source device of that last device replace operation logged in dmesg/syslog. Also the ID of that zone's device was 0 which is a special value set in the source device of a device replace operation when the operation finishes (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()), confirming again that device /dev/sdd was the source of a device replace operation. Normally there should be as many zones in the readahead extent as there are devices, and I wasn't expecting the extent to be in a block group with a 'single' profile, so I went and confirmed with the following drgn script that there weren't any single profile block groups: $ cat dump_block_groups.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) BTRFS_BLOCK_GROUP_DATA = (1 << 0) BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1) BTRFS_BLOCK_GROUP_METADATA = (1 << 2) BTRFS_BLOCK_GROUP_RAID0 = (1 << 3) BTRFS_BLOCK_GROUP_RAID1 = (1 << 4) BTRFS_BLOCK_GROUP_DUP = (1 << 5) BTRFS_BLOCK_GROUP_RAID10 = (1 << 6) BTRFS_BLOCK_GROUP_RAID5 = (1 << 7) BTRFS_BLOCK_GROUP_RAID6 = (1 << 8) BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9) BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10) def bg_flags_string(bg): flags = bg.flags.value_() ret = '' if flags & BTRFS_BLOCK_GROUP_DATA: ret = 'data' if flags & BTRFS_BLOCK_GROUP_METADATA: if len(ret) > 0: ret += '|' ret += 'meta' if flags & BTRFS_BLOCK_GROUP_SYSTEM: if len(ret) > 0: ret += '|' ret += 'system' if flags & BTRFS_BLOCK_GROUP_RAID0: ret += ' raid0' elif flags & BTRFS_BLOCK_GROUP_RAID1: ret += ' raid1' elif flags & BTRFS_BLOCK_GROUP_DUP: ret += ' dup' elif flags & BTRFS_BLOCK_GROUP_RAID10: ret += ' raid10' elif flags & BTRFS_BLOCK_GROUP_RAID5: ret += ' raid5' elif flags & BTRFS_BLOCK_GROUP_RAID6: ret += ' raid6' elif flags & BTRFS_BLOCK_GROUP_RAID1C3: ret += ' raid1c3' elif flags & BTRFS_BLOCK_GROUP_RAID1C4: ret += ' raid1c4' else: ret += ' single' return ret def dump_bg(bg): print() print(f'block group at {hex(bg.value_())}') print(f'\t start {bg.start.value_()} length {bg.length.value_()}') print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}') bg_root = fs_info.block_group_cache_tree.address_of_() for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'): dump_bg(bg) $ drgn dump_block_groups.py block group at 0xffff8f3d673b0400 start 22020096 length 16777216 flags 258 - system raid6 block group at 0xffff8f3d53ddb400 start 38797312 length 536870912 flags 260 - meta raid6 block group at 0xffff8f3d5f4d9c00 start 575668224 length 2147483648 flags 257 - data raid6 block group at 0xffff8f3d08189000 start 2723151872 length 67108864 flags 258 - system raid6 block group at 0xffff8f3db70ff000 start 2790260736 length 1073741824 flags 260 - meta raid6 block group at 0xffff8f3d5f4dd800 start 3864002560 length 67108864 flags 258 - system raid6 block group at 0xffff8f3d67037000 start 3931111424 length 2147483648 flags 257 - data raid6 $ So there were only 2 reasons left for having a readahead extent with a single zone: reada_find_zone(), called when creating a readahead extent, returned NULL either because we failed to find the corresponding block group or because a memory allocation failed. With some additional and custom tracing I figured out that on every further ocurrence of the problem the block group had just been deleted when we were looping to create the zones for the readahead extent (at reada_find_extent()), so we ended up with only one zone in the readahead extent, corresponding to a device that ends up getting replaced. So after figuring that out it became obvious why the hang happens: 1) Task A starts a scrub on any device of the filesystem, except for device /dev/sdd; 2) Task B starts a device replace with /dev/sdd as the source device; 3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently starting to scrub a stripe from block group X. This call to btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add() calls reada_add_block(), it passes the logical address of the extent tree's root node as its 'logical' argument - a value of 38928384; 4) Task A then enters reada_find_extent(), called from reada_add_block(). It finds there isn't any existing readahead extent for the logical address 38928384, so it proceeds to the path of creating a new one. It calls btrfs_map_block() to find out which stripes exist for the block group X. On the first iteration of the for loop that iterates over the stripes, it finds the stripe for device /dev/sdd, so it creates one zone for that device and adds it to the readahead extent. Before getting into the second iteration of the loop, the cleanup kthread deletes block group X because it was empty. So in the iterations for the remaining stripes it does not add more zones to the readahead extent, because the calls to reada_find_zone() returned NULL because they couldn't find block group X anymore. As a result the new readahead extent has a single zone, corresponding to the device /dev/sdd; 4) Before task A returns to btrfs_reada_add() and queues the readahead job for the readahead work queue, task B finishes the device replace and at btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new device /dev/sdg; 5) Task A returns to reada_add_block(), which increments the counter "->elems" of the reada_control structure allocated at btrfs_reada_add(). Then it returns back to btrfs_reada_add() and calls reada_start_machine(). This queues a job in the readahead work queue to run the function reada_start_machine_worker(), which calls __reada_start_machine(). At __reada_start_machine() we take the device list mutex and for each device found in the current device list, we call reada_start_machine_dev() to start the readahead work. However at this point the device /dev/sdd was already freed and is not in the device list anymore. This means the corresponding readahead for the extent at 38928384 is never started, and therefore the "->elems" counter of the reada_control structure allocated at btrfs_reada_add() never goes down to 0, causing the call to btrfs_reada_wait(), done by the scrub task, to wait forever. Note that the readahead request can be made either after the device replace started or before it started, however in pratice it is very unlikely that a device replace is able to start after a readahead request is made and is able to complete before the readahead request completes - maybe only on a very small and nearly empty filesystem. This hang however is not the only problem we can have with readahead and device removals. When the readahead extent has other zones other than the one corresponding to the device that is being removed (either by a device replace or a device remove operation), we risk having a use-after-free on the device when dropping the last reference of the readahead extent. For example if we create a readahead extent with two zones, one for the device /dev/sdd and one for the device /dev/sde: 1) Before the readahead worker starts, the device /dev/sdd is removed, and the corresponding btrfs_device structure is freed. However the readahead extent still has the zone pointing to the device structure; 2) When the readahead worker starts, it only finds device /dev/sde in the current device list of the filesystem; 3) It starts the readahead work, at reada_start_machine_dev(), using the device /dev/sde; 4) Then when it finishes reading the extent from device /dev/sde, it calls __readahead_hook() which ends up dropping the last reference on the readahead extent through the last call to reada_extent_put(); 5) At reada_extent_put() it iterates over each zone of the readahead extent and attempts to delete an element from the device's 'reada_extents' radix tree, resulting in a use-after-free, as the device pointer of the zone for /dev/sdd is now stale. We can also access the device after dropping the last reference of a zone, through reada_zone_release(), also called by reada_extent_put(). And a device remove suffers the same problem, however since it shrinks the device size down to zero before removing the device, it is very unlikely to still have readahead requests not completed by the time we free the device, the only possibility is if the device has a very little space allocated. While the hang problem is exclusive to scrub, since it is currently the only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free problem affects any path that triggers readhead, which includes btree_readahead_hook() and __readahead_hook() (a readahead worker can trigger readahed for the children of a node) for example - any path that ends up calling reada_add_block() can trigger the use-after-free after a device is removed. So fix this by waiting for any readahead requests for a device to complete before removing a device, ensuring that while waiting for existing ones no new ones can be made. This problem has been around for a very long time - the readahead code was added in 2011, device remove exists since 2008 and device replace was introduced in 2013, hard to pick a specific commit for a git Fixes tag. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96c2e067 |
|
30-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: skip devices without magic signature when mounting Many things can happen after the device is scanned and before the device is mounted. One such thing is losing the BTRFS_MAGIC on the device. If it happens we still won't free that device from the memory and cause the userland confusion. For example: As the BTRFS_IOC_DEV_INFO still carries the device path which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists device which does not belong to the filesystem anymore: $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb $ wipefs -a /dev/sdb # /dev/sdb does not contain magic signature $ mount -o degraded /dev/sda /btrfs $ btrfs fi show -m Label: none uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd Total devices 2 FS bytes used 128.00KiB devid 1 size 3.00GiB used 571.19MiB path /dev/sda devid 2 size 3.00GiB used 571.19MiB path /dev/sdb We need to distinguish the missing signature and invalid superblock, so add a specific error code ENODATA for that. This also fixes failure of fstest btrfs/198. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92e26df4 |
|
18-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: return error if we're unable to read device stats I noticed when fixing device stats for seed devices that we simply threw away the return value from btrfs_search_slot(). This is because we may not have stat items, but we could very well get an error, and thus miss reporting the error up the chain. Fix this by returning ret if it's an actual error, and then stop trying to init the rest of the devices stats and return the error up the chain. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
124604eb |
|
18-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: init device stats for seed devices We recently started recording device stats across the fleet, and noticed a large increase in messages such as this BTRFS warning (device dm-0): get dev_stats failed, not yet valid on our tiers that use seed devices for their root devices. This is because we do not initialize the device stats for any seed devices if we have a sprout device and mount using that sprout device. The basic steps for reproducing are: $ mkfs seed device $ mount seed device # fill seed device $ umount seed device $ btrfstune -S 1 seed device $ mount seed device $ btrfs device add -f sprout device /mnt/wherever $ umount /mnt/wherever $ mount sprout device /mnt/wherever $ btrfs device stats /mnt/wherever This will fail with the above message in dmesg. Fix this by iterating over the fs_devices->seed if they exist in btrfs_init_dev_stats. This fixed the problem and properly reports the stats for both devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_device_init_dev_stats ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c83b60c0 |
|
04-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify gotos in open_seed_device The function does not have a common exit block and returns immediatelly so there's no point having the goto. Remove the two cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e493e8f9 |
|
04-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() We can check the argument value directly, no need for the temporary variable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e17125b5 |
|
04-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use sprout device_list_mutex in btrfs_init_devices_late On a mounted sprout filesystem, all threads now are using the sprout::device_list_mutex, and this is the only code using the seed::device_list_mutex. This patch converts to use the sprouts fs_info->fs_devices->device_list_mutex. The same reasoning holds true here, that device delete is holding the sprout::device_list_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
53f8a74c |
|
04-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: split and refactor btrfs_sysfs_remove_devices_dir Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split btrfs_sysfs_remove_devices_dir() so that we don't have to use the device argument to indicate whether to free all devices or just one device. Export btrfs_sysfs_remove_device() as device operations outside of sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir(). btrfs_sysfs_remove_devices_dir() is renamed to btrfs_sysfs_remove_fs_devices() to suite its new role. Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices() so it is redeclared s static. And the same function had to be moved before its first caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cd36da2e |
|
04-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: simplify parameters of btrfs_sysfs_add_devices_dir When we add a device we need to add it to sysfs, so instead of using the btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to add a device or all of fs_devices, call the helper function directly btrfs_sysfs_add_device() and thus make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
79dae17d |
|
03-Sep-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: improve device scanning messages Systems booting without the initramfs seems to scan an unusual kind of device path (/dev/root). And at a later time, the device is updated to the correct path. We generally print the process name and PID of the process scanning the device but we don't capture the same information if the device path is rescanned with a different pathname. The current message is too long, so drop the unnecessary UUID and add process name and PID. While at this also update the duplicate device warning to include the process name and PID so the messages are consistent CC: stable@vger.kernel.org # 4.19+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3e1f96c |
|
25-Aug-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: enumerate the type of exclusive operation in progress Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ca10845a |
|
01-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: sysfs: init devices outside of the chunk_mutex While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc3+ #4 Not tainted ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc+0x37/0x270 alloc_inode+0x82/0xb0 iget_locked+0x10d/0x2c0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x240 sysfs_get_tree+0x16/0x40 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (kernfs_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 kernfs_add_one+0x23/0x150 kernfs_create_link+0x63/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x81/0x130 btrfs_init_new_device+0x67f/0x1250 btrfs_ioctl+0x1ef/0x2e20 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_insert_empty_items+0x64/0xb0 btrfs_insert_delayed_items+0x90/0x4f0 btrfs_commit_inode_delayed_items+0x93/0x140 btrfs_log_inode+0x5de/0x2020 btrfs_log_inode_parent+0x429/0xc90 btrfs_log_new_name+0x95/0x9b btrfs_rename2+0xbb9/0x1800 vfs_rename+0x64f/0x9f0 do_renameat2+0x320/0x4e0 __x64_sys_rename+0x1f/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb8 check_noncircular+0x12d/0x150 __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa7/0x3d0 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x41/0x50 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we are holding the chunk_mutex at the time of adding in a new device. However we only need to hold the device_list_mutex, as we're going to iterate over the fs_devices devices. Move the sysfs init stuff outside of the chunk_mutex to get rid of this lockdep splat. CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions CC: stable@vger.kernel.org # 4.4.x Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9ba017f |
|
22-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: don't opencode sync_blockdev in btrfs_init_new_device Instead of opencoding filemap_write_and_wait simply call syncblockdev as it makes it abundantly clear what's going on and why this is used. No semantics changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ae312e9 |
|
22-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove redundant code from btrfs_free_stale_devices Following the refactor of btrfs_free_stale_devices in 7bcb8164ad94 ("btrfs: use device_list_mutex when removing stale devices") fs_devices are freed after they have been iterated by the inner list_for_each so the use-after-free fixed by introducing the break in fd649f10c3d2 ("btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device") is no longer necessary. Just remove it altogether. No functional changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44cab9ba |
|
22-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: refactor locked condition in btrfs_init_new_device Invert unlocked to locked and exploit the fact it can only ever be modified if we are adding a new device to a seed filesystem. This allows to simplify the check in error: label. No semantics changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f4cfa9bd |
|
22-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: use RCU for quick device check in btrfs_init_new_device When adding a new device there's a mandatory check to see if a device is being duplicated to the filesystem it's added to. Since this is a read-only operations not necessary to take device_list_mutex and can simply make do with an rcu-readlock. Using just RCU is safe because there won't be another device add delete running in parallel as btrfs_init_new_device is called only from btrfs_ioctl_add_dev. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
425c6ed6 |
|
10-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not hold device_list_mutex when closing devices The following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted ------------------------------------------------------ fsstress/8739 is trying to acquire lock: ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70 but task is already holding lock: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #10 (sb_pagefaults){.+.+}-{0:0}: __sb_start_write+0x129/0x210 btrfs_page_mkwrite+0x6a/0x4a0 do_page_mkwrite+0x4d/0xc0 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 asm_exc_page_fault+0x1e/0x30 -> #9 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 perf_read+0x141/0x2c0 vfs_read+0xad/0x1b0 ksys_read+0x5f/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #8 (&cpuctx_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x88/0x150 perf_event_init+0x1db/0x20b start_kernel+0x3ae/0x53c secondary_startup_64+0xa4/0xb0 -> #7 (pmus_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x4f/0x150 cpuhp_invoke_callback+0xb1/0x900 _cpu_up.constprop.26+0x9f/0x130 cpu_up+0x7b/0xc0 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x71 kernel_init_freeable+0x110/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #6 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x39/0xb0 kmem_cache_create_usercopy+0x28/0x230 kmem_cache_create+0x12/0x20 bioset_init+0x15e/0x2b0 init_bio+0xa3/0xaa do_one_initcall+0x5a/0x2e0 kernel_init_freeable+0x1f4/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #5 (bio_slab_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 bioset_init+0xbc/0x2b0 __blk_alloc_queue+0x6f/0x2d0 blk_mq_init_queue_data+0x1b/0x70 loop_add+0x110/0x290 [loop] fq_codel_tcf_block+0x12/0x20 [sch_fq_codel] do_one_initcall+0x5a/0x2e0 do_init_module+0x5a/0x220 load_module+0x2459/0x26e0 __do_sys_finit_module+0xba/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (loop_ctl_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 lo_open+0x18/0x50 [loop] __blkdev_get+0xec/0x570 blkdev_get+0xe8/0x150 do_dentry_open+0x167/0x410 path_openat+0x7c9/0xa80 do_filp_open+0x93/0x100 do_sys_openat2+0x22a/0x2e0 do_sys_open+0x4b/0x80 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 blkdev_put+0x1d/0x120 close_fs_devices.part.31+0x84/0x130 btrfs_close_devices+0x44/0xb0 close_ctree+0x296/0x2b2 generic_shutdown_super+0x69/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_run_dev_stats+0x49/0x480 commit_cowonly_roots+0xb5/0x2a0 btrfs_commit_transaction+0x516/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_commit_transaction+0x4bb/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __mutex_lock+0x9f/0x930 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 file_update_time+0xc8/0x110 btrfs_page_mkwrite+0x10c/0x4a0 do_page_mkwrite+0x4d/0xc0 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_pagefaults); lock(&mm->mmap_lock#2); lock(sb_pagefaults); lock(&fs_info->reloc_mutex); *** DEADLOCK *** 3 locks held by fsstress/8739: #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0 stack backtrace: CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 ? btrfs_get_alloc_profile+0x150/0x210 lock_acquire+0x9e/0x360 ? btrfs_record_root_in_trans+0x43/0x70 __mutex_lock+0x9f/0x930 ? btrfs_record_root_in_trans+0x43/0x70 ? lock_acquire+0x9e/0x360 ? join_transaction+0x5d/0x450 ? find_held_lock+0x2d/0x90 ? btrfs_record_root_in_trans+0x43/0x70 ? join_transaction+0x3d5/0x450 ? btrfs_record_root_in_trans+0x43/0x70 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 file_update_time+0xc8/0x110 btrfs_page_mkwrite+0x10c/0x4a0 ? handle_mm_fault+0x5e/0x1730 do_page_mkwrite+0x4d/0xc0 ? __do_fault+0x32/0x150 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7faa6c9969c4 Was seen in testing. The fix is similar to that of btrfs: open device without device_list_mutex where we're holding the device_list_mutex and then grab the bd_mutex, which pulls in a bunch of dependencies under the bd_mutex. We only ever call btrfs_close_devices() on mount failure or unmount, so we're save to not have the device_list_mutex here. We're already holding the uuid_mutex which keeps us safe from any external modification of the fs_devices. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62cf5391 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks When closing and freeing the source device we could end up doing our final blkdev_put() on the bdev, which will grab the bd_mutex. As such we want to be holding as few locks as possible, so move this call outside of the dev_replace->lock_finishing_cancel_unmount lock. Since we're modifying the fs_devices we need to make sure we're holding the uuid_mutex here, so take that as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
68abf360 |
|
12-Aug-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove alloc_list splice in btrfs_prepare_sprout btrfs_prepare_sprout is called when the first rw device is added to a seed filesystem. This means the filesystem can't have its alloc_list be non-empty, since seed filesystems are read only. Simply remove the code altogether. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
427c8fdd |
|
12-Aug-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: document some invariants of seed code Without good understanding of how seed devices works it's hard to grok some of what the code in open_seed_devices or btrfs_prepare_sprout does. Add comments hopefully reducing some of the cognitive load. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
944d3f9f |
|
16-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: switch seed device to list api While this patch touches a bunch of files the conversion is straighforward. Instead of using the implicit linked list anchored at btrfs_fs_devices::seed the code is switched to using list_for_each_entry. Previous patches in the series already factored out code that processed both main and seed devices so in those cases the factored out functions are called on the main fs_devices and then on every seed dev inside list_for_each_entry. Using list api also allows to simplify deletion from the seed dev list performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by substituting a while() loop with a simple list_del_init. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4989c2f |
|
15-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: simplify setting/clearing fs_info to btrfs_fs_devices It makes no sense to have sysfs-related routines be responsible for properly initialising the fs_info pointer of struct btrfs_fs_device. Instead this can be streamlined by making it the responsibility of btrfs_init_devices_late to initialize it. That function already initializes fs_info of every individual device in btrfs_fs_devices. As far as clearing it is concerned it makes sense to move it to close_fs_devices. That function is only called when struct btrfs_fs_devices is no longer in use - either for holding seeds or main devices for a mounted filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
54eed6ae |
|
15-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make close_fs_devices return void The return value of this function conveys absolutely no information. All callers already check the state of fs_devices->opened to decide how to proceed. So convert the function to returning void. While at it make btrfs_close_devices also return void. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3712ccb7 |
|
16-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: factor out loop logic from btrfs_free_extra_devids This prepares the code to switching seeds devices to a proper list. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
154f7cb8 |
|
20-Aug-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: add owner and fs_info to alloc_state io_tree Commit 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree") introduced btrfs_device::alloc_state extent io tree, but it doesn't initialize the fs_info and owner member. This means the following features are not properly supported: - Fs owner report for insert_state() error Without fs_info initialized, although btrfs_err() won't panic, it won't output which fs is causing the error. - Wrong owner for trace events alloc_state will get the owner as pinned extents. Fix this by assiging proper fs_info and owner for btrfs_device::alloc_state. Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e560081 |
|
12-Aug-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid It can be accessed from 'fs_devices' as it's identical to fs_info->fs_devices. Also add a comment about why we are calling the function. No semantic changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a466c85e |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks When closing and freeing the source device we could end up doing our final blkdev_put() on the bdev, which will grab the bd_mutex. As such we want to be holding as few locks as possible, so move this call outside of the dev_replace->lock_finishing_cancel_unmount lock. Since we're modifying the fs_devices we need to make sure we're holding the uuid_mutex here, so take that as well. There's a report from syzbot probably hitting one of the cases where the bd_mutex and device_list_mutex are taken in the wrong order, however it's not with device replace, like this patch fixes. As there's no reproducer available so far, we can't verify the fix. https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/ dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419 WARNING: possible circular locking dependency detected 5.9.0-rc5-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.0/6878 is trying to acquire lock: ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804 but task is already holding lock: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255 btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109 __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916 find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline] find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370 do_writepages+0xec/0x290 mm/page-writeback.c:2352 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894 wb_do_writeback fs/fs-writeback.c:2039 [inline] wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 -> #3 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x234/0x470 fs/super.c:1672 sb_start_intwrite include/linux/fs.h:1690 [inline] start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624 find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline] find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370 do_writepages+0xec/0x290 mm/page-writeback.c:2352 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894 wb_do_writeback fs/fs-writeback.c:2039 [inline] wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __flush_work+0x60e/0xac0 kernel/workqueue.c:3041 wb_shutdown+0x180/0x220 mm/backing-dev.c:355 bdi_unregister+0x174/0x590 mm/backing-dev.c:872 del_gendisk+0x820/0xa10 block/genhd.c:933 loop_remove drivers/block/loop.c:2192 [inline] loop_control_ioctl drivers/block/loop.c:2291 [inline] loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (loop_ctl_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 lo_open+0x19/0xd0 drivers/block/loop.c:1893 __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507 blkdev_get fs/block_dev.c:1639 [inline] blkdev_open+0x227/0x300 fs/block_dev.c:1753 do_dentry_open+0x4b9/0x11b0 fs/open.c:817 do_open fs/namei.c:3251 [inline] path_openat+0x1b9a/0x2730 fs/namei.c:3368 do_filp_open+0x17e/0x3c0 fs/namei.c:3395 do_sys_openat2+0x16d/0x420 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_sys_open fs/open.c:1192 [inline] __se_sys_open fs/open.c:1188 [inline] __x64_sys_open+0x119/0x1c0 fs/open.c:1188 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&bdev->bd_mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 blkdev_put+0x30/0x520 fs/block_dev.c:1804 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline] btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline] btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline] close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149 generic_shutdown_super+0x144/0x370 fs/super.c:464 kill_anon_super+0x36/0x60 fs/super.c:1108 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265 deactivate_locked_super+0x94/0x160 fs/super.c:335 deactivate_super+0xad/0xd0 fs/super.c:366 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:163 [inline] exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_devs->device_list_mutex); lock(sb_internal#2); lock(&fs_devs->device_list_mutex); lock(&bdev->bd_mutex); *** DEADLOCK *** 3 locks held by syz-executor.0/6878: #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365 #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178 #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159 stack backtrace: CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827 check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 blkdev_put+0x30/0x520 fs/block_dev.c:1804 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline] btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline] btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline] close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149 generic_shutdown_super+0x144/0x370 fs/super.c:464 kill_anon_super+0x36/0x60 fs/super.c:1108 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265 deactivate_locked_super+0x94/0x160 fs/super.c:335 deactivate_super+0xad/0xd0 fs/super.c:366 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:163 [inline] exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x460027 RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027 RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0 RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460 R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460 Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add syzbot reference ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
313b0858 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing We need to move the closing of the src_device out of all the device replace locking, but we definitely want to zero out the superblock before we commit the last time to make sure the device is properly removed. Handle this by pushing btrfs_scratch_superblocks into btrfs_dev_replace_finishing, and then later on we'll move the src_device closing and freeing stuff where we need it to be. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fccc0007 |
|
31-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix lockdep splat in add_missing_dev Nikolay reported a lockdep splat in generic/476 that I could reproduce with btrfs/187. ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc2+ #1 Tainted: G W ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc_trace+0x3a/0x1a0 btrfs_alloc_device+0x43/0x210 add_missing_dev+0x20/0x90 read_one_chunk+0x301/0x430 btrfs_read_sys_array+0x17b/0x1b0 open_ctree+0xa62/0x1896 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x379 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_lookup_inode+0x2a/0x8f __btrfs_update_delayed_inode+0x80/0x240 btrfs_commit_inode_delayed_inode+0x119/0x120 btrfs_evict_inode+0x357/0x500 evict+0xcf/0x1f0 vfs_rmdir.part.0+0x149/0x160 do_rmdir+0x136/0x1a0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x12d/0x150 __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa4/0x3d0 ? btrfs_evict_inode+0x11e/0x500 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x46/0x60 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This is because we are holding the chunk_mutex when we call btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want to switch that to a GFP_NOFS lock because this is the only place where it matters. So instead use memalloc_nofs_save() around the allocation in order to avoid the lockdep splat. Reported-by: Nikolay Borisov <nborisov@suse.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9771a5cf |
|
10-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop path before adding new uuid tree entry With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted ------------------------------------------------------ btrfs-uuid/7955 is trying to acquire lock: ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 but task is already holding lock: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-uuid-00){++++}-{3:3}: down_read_nested+0x3e/0x140 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_uuid_tree_add+0x89/0x2d0 btrfs_uuid_scan_kthread+0x330/0x390 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #0 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 down_read_nested+0x3e/0x140 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_find_root+0x45/0x1b0 btrfs_read_tree_root+0x61/0x100 btrfs_get_root_ref.part.50+0x143/0x630 btrfs_uuid_tree_iterate+0x207/0x314 btrfs_uuid_rescan_kthread+0x12/0x50 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-uuid-00); lock(btrfs-root-00); lock(btrfs-uuid-00); lock(btrfs-root-00); *** DEADLOCK *** 1 lock held by btrfs-uuid/7955: #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __btrfs_tree_read_lock+0x39/0x180 ? btrfs_root_node+0x1c/0x1d0 down_read_nested+0x3e/0x140 ? __btrfs_tree_read_lock+0x39/0x180 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_find_root+0x45/0x1b0 btrfs_read_tree_root+0x61/0x100 btrfs_get_root_ref.part.50+0x143/0x630 btrfs_uuid_tree_iterate+0x207/0x314 ? btree_readpage+0x20/0x20 btrfs_uuid_rescan_kthread+0x12/0x50 kthread+0x133/0x150 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This problem exists because we have two different rescan threads, btrfs_uuid_scan_kthread which creates the uuid tree, and btrfs_uuid_tree_iterate that goes through and updates or deletes any out of date roots. The problem is they both do things in different order. btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but then does a btrfs_get_fs_root() which can read from the tree_root. It's actually easy enough to not be holding the path in btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop it further down and re-start the search when we loop. So simply move the path release before we add our entry to the uuid tree. This also fixes a problem where we're holding a path open after we do btrfs_end_transaction(), which has it's own problems. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c57dd1f2 |
|
31-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary [BUG] The following script can lead to tons of beyond device boundary access: mkfs.btrfs -f $dev -b 10G mount $dev $mnt trimfs $mnt btrfs filesystem resize 1:-1G $mnt trimfs $mnt [CAUSE] Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit"), we try to avoid trimming ranges that's already trimmed. So we check device->alloc_state by finding the first range which doesn't have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. But if we shrunk the device, that bits are not cleared, thus we could easily got a range starts beyond the shrunk device size. This results the returned @start and @end are all beyond device size, then we call "end = min(end, device->total_bytes -1);" making @end smaller than device size. Then finally we goes "len = end - start + 1", totally underflow the result, and lead to the beyond-device-boundary access. [FIX] This patch will fix the problem in two ways: - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device This is the root fix - Add extra safety check when trimming free device extents We check and warn if the returned range is already beyond current device. Link: https://github.com/kdave/btrfs-progs/issues/282 Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
01d01caf |
|
17-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the chunk_mutex in btrfs_read_chunk_tree We are currently getting this lockdep splat in btrfs/161: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc5+ #20 Tainted: G E ------------------------------------------------------ mount/678048 is trying to acquire lock: ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs] but task is already holding lock: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x8b/0x8f0 btrfs_init_new_device+0x2d2/0x1240 [btrfs] btrfs_ioctl+0x1de/0x2d20 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __lock_acquire+0x1240/0x2460 lock_acquire+0xab/0x360 __mutex_lock+0x8b/0x8f0 clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] btrfs_mount_root.cold+0x13/0xfa [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); *** DEADLOCK *** 3 locks held by mount/678048: #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs] #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] stack backtrace: CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 Call Trace: dump_stack+0x96/0xd0 check_noncircular+0x162/0x180 __lock_acquire+0x1240/0x2460 ? asm_sysvec_apic_timer_interrupt+0x12/0x20 lock_acquire+0xab/0x360 ? clone_fs_devices+0x4d/0x170 [btrfs] __mutex_lock+0x8b/0x8f0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? rcu_read_lock_sched_held+0x52/0x60 ? cpumask_next+0x16/0x20 ? module_assert_mutex_or_preempt+0x14/0x40 ? __module_address+0x28/0xf0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? static_obj+0x4f/0x60 ? lockdep_init_map_waits+0x43/0x200 ? clone_fs_devices+0x4d/0x170 [btrfs] clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] ? super_setup_bdi_name+0x79/0xd0 btrfs_mount_root.cold+0x13/0xfa [btrfs] ? vfs_parse_fs_string+0x84/0xb0 ? rcu_read_lock_sched_held+0x52/0x60 ? kfree+0x2b5/0x310 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] ? cred_has_capability+0x7c/0x120 ? rcu_read_lock_sched_held+0x52/0x60 ? legacy_get_tree+0x30/0x50 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 ? memdup_user+0x4e/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and then read the device, which takes the device_list_mutex. The device_list_mutex needs to be taken before the chunk_mutex, so this is a problem. We only really need the chunk mutex around adding the chunk, so move the mutex around read_one_chunk. An argument could be made that we don't even need the chunk_mutex here as it's during mount, and we are protected by various other locks. However we already have special rules for ->device_list_mutex, and I'd rather not have another special case for ->chunk_mutex. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
18c850fd |
|
17-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: open device without device_list_mutex There's long existed a lockdep splat because we open our bdev's under the ->device_list_mutex at mount time, which acquires the bd_mutex. Usually this goes unnoticed, but if you do loopback devices at all suddenly the bd_mutex comes with a whole host of other dependencies, which results in the splat when you mount a btrfs file system. ====================================================== WARNING: possible circular locking dependency detected 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted ------------------------------------------------------ systemd-journal/509 is trying to acquire lock: ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs] but task is already holding lock: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sb_pagefaults){.+.+}-{0:0}: __sb_start_write+0x13e/0x220 btrfs_page_mkwrite+0x59/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 -> #5 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x60/0x80 _copy_from_user+0x20/0xb0 get_sg_io_hdr+0x9a/0xb0 scsi_cmd_ioctl+0x1ea/0x2f0 cdrom_ioctl+0x3c/0x12b4 sr_block_ioctl+0xa4/0xd0 block_ioctl+0x3f/0x50 ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (&cd->lock){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 sr_block_open+0xa2/0x180 __blkdev_get+0xdd/0x550 blkdev_get+0x38/0x150 do_dentry_open+0x16b/0x3e0 path_openat+0x3c9/0xa00 do_filp_open+0x75/0x100 do_sys_openat2+0x8a/0x140 __x64_sys_openat+0x46/0x70 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 __blkdev_get+0x6a/0x550 blkdev_get+0x85/0x150 blkdev_get_by_path+0x2c/0x70 btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] open_fs_devices+0x88/0x240 [btrfs] btrfs_open_devices+0x92/0xa0 [btrfs] btrfs_mount_root+0x250/0x490 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x119/0x380 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x8c6/0xca0 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_run_dev_stats+0x36/0x420 [btrfs] commit_cowonly_roots+0x91/0x2d0 [btrfs] btrfs_commit_transaction+0x4e6/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_commit_transaction+0x48e/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}: __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 __mutex_lock+0x7b/0x820 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_pagefaults); lock(&mm->mmap_lock#2); lock(sb_pagefaults); lock(&fs_info->reloc_mutex); *** DEADLOCK *** 3 locks held by systemd-journal/509: #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs] stack backtrace: CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x134/0x150 __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] __mutex_lock+0x7b/0x820 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? kvm_sched_clock_read+0x14/0x30 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xc/0xb0 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] ? sched_clock+0x5/0x10 do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fa3972fdbfe Code: Bad RIP value. Fix this by not holding the ->device_list_mutex at this point. The device_list_mutex exists to protect us from modifying the device list while the file system is running. However it can also be modified by doing a scan on a device. But this action is specifically protected by the uuid_mutex, which we are holding here. We cannot race with opening at this point because we have the ->s_mount lock held during the mount. Not having the ->device_list_mutex here is perfectly safe as we're not going to change the devices at this point. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44d354ab |
|
12-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: review the call sites which can be interrupted by signal Since most metadata reservation calls can return -EINTR when get interrupted by fatal signal, we need to review the all the metadata reservation call sites. In relocation code, the metadata reservation happens in the following sites: - btrfs_block_rsv_refill() in merge_reloc_root() merge_reloc_root() is a pretty critical section, we don't want to be interrupted by signal, so change the flush status to BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal. Since such change can be ENPSPC-prone, also shrink the amount of metadata to reserve least amount avoid deadly ENOSPC there. - btrfs_block_rsv_refill() in reserve_metadata_space() It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted by signal. - btrfs_block_rsv_refill() in prepare_to_relocate() - btrfs_block_rsv_add() in prepare_to_relocate() - btrfs_block_rsv_refill() in relocate_block_group() - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster() - btrfs_start_transaction() in relocate_block_group() - btrfs_start_transaction() in create_reloc_inode() Can be interrupted by fatal signal and we can handle it easily. For these call sites, just catch the -EINTR value in btrfs_balance() and count them as canceled. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d85327b1 |
|
08-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: prefetch chunk tree leaves at mount The whole chunk tree is read at mount time so we can utilize readahead to get the tree blocks to memory before we read the items. The idea is from Robbie, but instead of updating search slot readahead, this patch implements the chunk tree readahead manually from nodes on level 1. We've decided to do specific readahead optimizations and then unify them under a common API so we don't break everything by changing the search slot readahead logic. Higher chunk trees grow on large filesystems (many terabytes), and prefetching just level 1 seems to be sufficient. Provided example was from a 200TiB filesystem with chunk tree level 2. CC: Robbie Ko <robbieko@synology.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
608769a4 |
|
02-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers Since btrfs_bio always contains the extra space for the tgtdev_map and raid_maps it's pointless to make the assignment iff specific conditions are met. Instead, always assign the pointers to their correct value at allocation time. To accommodate this change also move code a bit in __btrfs_map_block so that btrfs_bio::stripes array is always initialized before the raid_map, subsequently move the call to sort_parity_stripes in the 'if' building the raid_map, retaining the old behavior. To better understand the change, there are 2 aspects to this: 1. The original code is harder to grasp because the calculations for initializing raid_map/tgtdev ponters are apart from the initial allocation of memory. Having them predicated on 2 separate checks doesn't help that either... So by moving the initialisation in alloc_btrfs_bio puts everything together. 2. tgtdev/raid_maps are now always initialized despite sometimes they might be equal i.e __btrfs_map_block_for_discard calls alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated on external checks i.e. just because those pointers are non-null doesn't mean they are valid per-se. And actually while taking another look at __btrfs_map_block I saw a discrepancy: Original code initialised tgtdev_map if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) However, further down tgtdev_map is only used if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) e.g. the additional need_full_stripe(op) predicate is there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from mail discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3eee86c8 |
|
02-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: don't check for btrfs_device::bdev in btrfs_end_bio btrfs_map_bio ensures that all submitted bios to devices have valid btrfs_device::bdev so this check can be removed from btrfs_end_bio. This check was added in june 2012 597a60fadedf ("Btrfs: don't count I/O statistic read errors for missing devices") but then in October of the same year another commit de1ee92ac3bc ("Btrfs: recheck bio against block device when we map the bio") started checking for the presence of btrfs_device::bdev before actually issuing the bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c31efbdf |
|
03-Jul-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: record btrfs_device directly in btrfs_io_bio Instead of recording stripe_index and using that to access correct btrfs_device from btrfs_bio::stripes record the btrfs_device in btrfs_io_bio. This will enable endio handlers to increment device error counters on checksum errors. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3502a8c0 |
|
24-Jun-2020 |
David Sterba <dsterba@suse.com> |
btrfs: allow use of global block reserve for balance item deletion On a filesystem with exhausted metadata, but still enough to start balance, it's possible to hit this error: [324402.053842] BTRFS info (device loop0): 1 enospc errors during balance [324402.060769] BTRFS info (device loop0): balance: ended with status: -28 [324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left It fails inside reset_balance_state and turns the filesystem to read-only, which is unnecessary and should be fixed too, but the problem is caused by lack for space when the balance item is deleted. This is a one-time operation and from the same rank as unlink that is allowed to use the global block reserve. So do the same for the balance item. Status of the filesystem (100GiB) just after the balance fails: $ btrfs fi df mnt Data, single: total=80.01GiB, used=38.58GiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=19.99GiB, used=19.48GiB GlobalReserve, single: total=512.00MiB, used=50.11MiB CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48cfa61b |
|
16-Jul-2020 |
Boris Burkov <boris@bur.io> |
btrfs: fix mount failure caused by race with umount It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0 CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(&s->s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(&sb_lock); hlist_del_init(&sb->s_instances); spin_unlock(&sb_lock); up_write(&sb->s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree before the calls to read_one_dev, while holding the sb umount semaphore and the uuid mutex. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ae3e715f |
|
13-May-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop stale reference to volume_mutex Commit dccdb07bc996 ("btrfs: kill btrfs_fs_info::volume_mutex") removed the last use of the volume_mutex, forgetting to update the comment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f551d96 |
|
04-May-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: free alien device after device add When an old device has new fsid through 'btrfs device add -f <dev>' our fs_devices list has an alien device in one of the fs_devices lists. By having an alien device in fs_devices, we have two issues so far 1. missing device does not not show as missing in the userland 2. degraded mount will fail Both issues are caused by the fact that there's an alien device in the fs_devices list. (Alien means that it does not belong to the filesystem, identified by fsid, or does not contain btrfs filesystem at all, eg. due to overwrite). A device can be scanned/added through the control device ioctls SCAN_DEV, DEVICES_READY or by ADD_DEV. And device coming through the control device is checked against the all other devices in the lists, but this was not the case for ADD_DEV. This patch fixes both issues above by removing the alien device. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
998a0671 |
|
04-May-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: include non-missing as a qualifier for the latest_bdev btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to the bdev with greatest device::generation number. For a typical-missing device the generation number is zero so fs_devices::latest_bdev will never point to it. But if the missing device is due to alienation [1], then device::generation is not zero and if it is greater or equal to the rest of device generations in the list, then fs_devices::latest_bdev ends up pointing to the missing device and reports the error like [2]. [1] We maintain devices of a fsid (as in fs_device::fsid) in the fs_devices::devices list, a device is considered as an alien device if its fsid does not match with the fs_device::fsid Consider a working filesystem with raid1: $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb $ mount /dev/sda /mnt-raid1 $ umount /mnt-raid1 While mnt-raid1 was unmounted the user force-adds one of its devices to another btrfs filesystem: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt-single $ btrfs dev add -f /dev/sda /mnt-single Now the original mnt-raid1 fails to mount in degraded mode, because fs_devices::latest_bdev is pointing to the alien device. $ mount -o degraded /dev/sdb /mnt-raid1 [2] mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing kernel: BTRFS error (device sdb): failed to read devices kernel: BTRFS error (device sdb): open_ctree failed Fix the root cause by checking if the device is not missing before it can be considered for the fs_devices::latest_bdev. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1ed802c9 |
|
28-Apr-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop useless goto in open_fs_devices There is no need of goto out in open_fs_devices() as there is nothing special done there. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b335eab8 |
|
15-Apr-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_read_disk_super return struct btrfs_disk_super Instead of returning both the page and the super block structure, make btrfs_read_disk_super just return a pointer to struct btrfs_disk_super. As a result the function signature is simplified. Also, read_cache_page_gfp can never return NULL so check its return value only for IS_ERR. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
726a3421 |
|
16-Feb-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: add error injection points for cancelling balance Introduce a new error injection point, should_cancel_balance(). It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but allows us to override the return value. Currently there are only one locations using this function: - btrfs_balance() It checks cancel before each block group. There are other locations checking fs_info->balance_cancel_req, but they are not used as an indicator to exit, so there is no need to use the wrapper. But there will be more locations coming, and some locations can cause kernel panic if not handled properly. So introduce this error injection to provide better test interface. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ba366c3 |
|
27-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: balance: factor out convert profile validation The validation follows the same steps for all three block group types, the existing helper validate_convert_profile can be enhanced and do more of the common things. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11c67b1a |
|
01-Mar-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Rename __btrfs_alloc_chunk to btrfs_alloc_chunk Having btrfs_alloc_chunk doesn't bring any value since it encapsulates a lockdep assert and a call to find_next_chunk. Simply rename the internal __btrfs_alloc_chunk function to the public one and remove it's 2nd parameter as all callers always pass the return value of find_next_chunk. Finally, migrate the call to lockdep_assert_held so as to not lose the check. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6aafb303 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: parameterize dev_extent_min for chunk allocation Currently, we ignore a device whose available space is less than "BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current allocation policy (to maximize the number of stripes). This commit parameterizes dev_extent_min, so that other policies can set their own lower limitat to ignore a device with insufficient space. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dce580ca |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: factor out create_chunk() Factor out create_chunk() from __btrfs_alloc_chunk(). This function finally creates a chunk. There is no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5badf512 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: factor out decide_stripe_size() Factor out decide_stripe_size() from __btrfs_alloc_chunk(). This function calculates the actual stripe size to allocate. decide_stripe_size() handles the common case to round down the 'ndevs' to 'devs_increment' and check the upper and lower limitation of 'ndevs'. decide_stripe_size_regular() decides the size of a stripe and the size of a chunk. The policy is to maximize the number of stripes. This commit has no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
560156cb |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: factor out gather_device_info() Factor out gather_device_info() from __btrfs_alloc_chunk(). This function iterates over devices list and gather information about devices. This commit also introduces "max_avail" and "dev_extent_min" to fold the same calculation to one variable. This commit has no functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
27c314d5 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: factor out init_alloc_chunk_ctl Factor out init_alloc_chunk_ctl() from __btrfs_alloc_chunk(). This function initialises parameters of "struct alloc_chunk_ctl" for allocation. init_alloc_chunk_ctl() handles a common part of the initialisation to load the RAID parameters from btrfs_raid_array. init_alloc_chunk_ctl_policy_regular() decides some parameters for its allocation. The last "else" case in the original code is moved to __btrfs_alloc_chunk() to handle the error case in the common code. Replace the BUG_ON with ASSERT() and error return at the same time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4f2bafe8 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: introduce alloc_chunk_ctl Introduce "struct alloc_chunk_ctl" to wrap needed parameters for the chunk allocation. This will be used to split __btrfs_alloc_chunk() into smaller functions. This commit folds a number of local variables in __btrfs_alloc_chunk() into one "struct alloc_chunk_ctl ctl". There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3b4ffa40 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: refactor find_free_dev_extent_start() Factor out two functions from find_free_dev_extent_start(). dev_extent_search_start() decides the starting position of the search. dev_extent_hole_check() checks if a hole found is suitable for device extent allocation. These functions also have the switch-cases to change the allocation behavior depending on the policy. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4a816c6 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: introduce chunk allocation policy Introduce chunk allocation policy for btrfs. This policy controls how chunks and device extents are allocated from devices. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b25c19f4 |
|
24-Feb-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: handle invalid profile in chunk allocation Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk(). Instead return -EINVAL with ASSERT() to catch a bug in the development stage. Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1db45a35 |
|
28-Nov-2019 |
David Sterba <dsterba@suse.com> |
btrfs: replace u_long type cast with unsigned long We don't use the u_XX types anywhere, though they're defined. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eeb6f172 |
|
28-Nov-2019 |
David Sterba <dsterba@suse.com> |
btrfs: raid56: simplify sort_parity_stripes Remove trivial comprator and open coded swap of two values. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
75fb2e9e |
|
02-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move mapping of block for discard to its caller There's a simple forwarded call based on the operation that would better fit the caller btrfs_map_block that's until now a trivial wrapper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c94bec2c |
|
14-Feb-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: bail out of uuid tree scanning if we're closing In doing my fsstress+EIO stress testing I started running into issues where umount would get stuck forever because the uuid checker was chewing through the thousands of subvolumes I had created. We shouldn't block umount on this, simply bail if we're unmounting the fs. We need to make sure we don't mark the UUID tree as ok, so we only set that bit if we made it through the whole rescan operation, but otherwise this is completely safe. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97f4dd09 |
|
18-Feb-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_check_uuid_tree private to disk-io.c It's used only during filesystem mount as such it can be made private to disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread as btrfs_check_uuid_tree is its sole caller. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
560b7a4a |
|
18-Feb-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate btrfs_uuid_tree_iterate is called from only once place and its 2nd argument is always btrfs_check_uuid_tree_entry. Simplify btrfs_uuid_tree_iterate's signature by removing its 2nd argument and directly calling btrfs_check_uuid_tree_entry. Also move the latter into uuid-tree.h. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8f32380d |
|
13-Feb-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: use the page cache for super block reading Super-block reading in BTRFS is done using buffer_heads. Buffer_heads have some drawbacks, like not being able to propagate errors from the lower layers. Directly use the page cache for reading the super blocks from disk or invalidating an on-disk super block. We have to use the page cache so to avoid races between mkfs and udev. See also 6f60cbd3ae44 ("btrfs: access superblock via pagecache in scan_one_device"). This patch unwraps the buffer head API and does not change the way the super block is actually read. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fbceb9f |
|
13-Feb-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: reduce scope of btrfs_scratch_superblocks() btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so remove it from the header file and mark it as static. Also move it above it's callers so we don't need a forward declaration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c514c9b1 |
|
13-Feb-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: don't kmap() pages from block devices Block device mappings are never in highmem so kmap() / kunmap() calls for pages from block devices are unneeded. Use page_address() instead of kmap() to get to the virtual addreses. While we're at it, read_cache_page_gfp() doesn't return NULL on error, only an ERR_PTR, so use IS_ERR() to check for errors. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f6d9abbc |
|
13-Feb-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Export btrfs_release_disk_super Preparatory patch for removal of buffer_head usage in btrfs. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f3cd2c58 |
|
12-Feb-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: sysfs, rename device_link add/remove functions Since commit 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes"), the functions btrfs_sysfs_add_device_link() and btrfs_sysfs_rm_device_link() do more than just adding and removing the device link as its name indicated. Rename them to be more specific that's about the directory with the attirbutes Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
00246528 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root We are now using these for all roots, rename them to btrfs_put_root() and btrfs_grab_root(); Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc44d7c4 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root Now that all callers of btrfs_get_fs_root are subsequently calling btrfs_grab_fs_root and handling dropping the ref when they are done appropriately, go ahead and push btrfs_grab_fs_root up into btrfs_get_fs_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fbb0ce40 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry We lookup the uuid of arbitrary subvolumes, hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3619c94f |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: open code btrfs_read_fs_root_no_name All this does is call btrfs_get_fs_root() with check_ref == true. Just use btrfs_get_fs_root() so we don't have a bunch of different helpers that do the same thing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1362089d |
|
10-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix split-brain handling when changing FSID to metadata uuid Current code doesn't correctly handle the situation which arises when a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID changed to the one in metadata uuid. This causes the incompat flag to disappear. In case of a power failure we could end up in a situation where part of the disks in a multi-disk filesystem are correctly reverted to METADATA_UUID_INCOMPAT flag unset state, while others have METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS. This patch corrects the behavior required to handle the case where a disk of the second type is scanned first, creating the necessary btrfs_fs_devices. Subsequently, when a disk which has already completed the transition is scanned it should overwrite the data in btrfs_fs_devices. Reported-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05840710 |
|
10-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Handle another split brain scenario with metadata uuid feature There is one more cases which isn't handled by the original metadata uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and the user decides to change the FSID to the original one e.g. have metadata_uuid and fsid match. In case of power failure while this operation is in progress we could end up in a situation where some of the disks have the incompat bit removed and the other half have both METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags. This patch handles the case where a disk that has successfully changed its FSID such that it equals METADATA_UUID is scanned first. Subsequently when a disk with both METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned find_fsid_changed won't be able to find an appropriate btrfs_fs_devices. This is done by extending find_fsid_changed to correctly find btrfs_fs_devices whose metadata_uuid/fsid are the same and they match the metadata_uuid of the currently scanned device. Fixes: cc5de4e70256 ("btrfs: Handle final split-brain possibility during fsid change") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reported-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6730a0e |
|
10-Jan-2020 |
Su Yue <Damenly_Su@gmx.com> |
btrfs: Factor out metadata_uuid code from find_fsid. find_fsid became rather hairy with the introduction of metadata uuid changing feature. Alleviate this by factoring out the metadata uuid specific code in a dedicated function which deals with finding correct fsid for a device with changed uuid. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c0d81c7c |
|
10-Jan-2020 |
Su Yue <Damenly_Su@gmx.com> |
btrfs: Call find_fsid from find_fsid_inprogress Since find_fsid_inprogress should also handle the case in which an fs didn't change its FSID make it call find_fsid directly. This makes the code in device_list_add simpler by eliminating a conditional call of find_fsid. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
96a14336 |
|
10-Dec-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Move and unexport btrfs_rmap_block It's used only during initial block group reading to map physical address of super block to a list of logical ones. Make it private to block-group.c, add proper kernel doc and ensure it's exported only for tests. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a69976bc |
|
09-Jan-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: device stats, log when stats are zeroed We had a report indicating that some read errors aren't reported by the device stats in the userland. It is important to have the errors reported in the device stat as user land scripts might depend on it to take the reasonable corrective actions. But to debug these issue we need to be really sure that request to reset the device stat did not come from the userland itself. So log an info message when device error reset happens. For example: BTRFS info (device sdc): device stats zeroed by btrfs(9223) Reported-by: philip@philip-seeger.de Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b0643e59 |
|
13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: add the beginning of async discard, discard workqueue When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701 discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
321f69f8 |
|
04-Dec-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: reset device back to allocation state when removing When closing a device, btrfs_close_one_device() first allocates a new device, copies the device to close's name, replaces it in the dev_list with the copy and then finally frees it. This involves two memory allocation, which can potentially fail. As this code path is tricky to unwind, the allocation failures where handled by BUG_ON()s. But this copying isn't strictly needed, all that is needed is resetting the device in question to it's state it had after the allocation. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3fff3975 |
|
26-Nov-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: decrement number of open devices after closing the device not before In btrfs_close_one_device we're decrementing the number of open devices before we're calling btrfs_close_bdev(). As there is no intermediate exit between these points in this function it is technically OK to do so, but it makes the code a bit harder to understand. Move both operations closer together and move the decrement step after btrfs_close_bdev(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b38f4cbd |
|
05-Dec-2019 |
Johannes Thumshirn <jth@kernel.org> |
btrfs: remove impossible WARN_ON in btrfs_destroy_dev_replace_tgtdev() We have a user report, that cppcheck is complaining about a possible NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev(). We're first dereferencing the 'tgtdev' variable and the later check for the validity of the pointer with a WARN_ON(!tgtdev); But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the WARN_ON() is impossible to hit. Just remove it to silence the checker's complains. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
db26a024 |
|
25-Nov-2019 |
David Sterba <dsterba@suse.com> |
btrfs: fill ncopies for all raid table entries Make the number of copies explicit even for entries that use the default 0 value for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4f6c6be |
|
13-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use raid_attr table in calc_stripe_length for nparity The table is already used for ncopies, replace open coding of stripes with the raid_attr value. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b35cf1f0 |
|
10-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: check rw_devices, not num_devices for balance The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf93e15e |
|
27-Nov-2019 |
David Sterba <dsterba@suse.com> |
btrfs: fix devs_max constraints for raid1c3 and raid1c4 The value 0 for devs_max means to spread the allocated chunks over all available devices, eg. stripe for RAID0 or RAID5. This got mistakenly copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and 4 copies respectively. Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)") Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f0432d0 |
|
13-Nov-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: change btrfs_fs_devices::rotating to bool struct btrfs_fs_devices::rotating currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0395d84f |
|
13-Nov-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: change btrfs_fs_devices::seeding to bool struct btrfs_fs_devices::seeding currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32da5386 |
|
29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cfbb825c |
|
10-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add incompat for raid1 with 3, 4 copies The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8d6fac00 |
|
02-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add support for 4-copy replication (raid1c4) Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47e6f742 |
|
02-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add support for 3-copy replication (raid1c3) Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b7faadd |
|
23-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Ensure we trim ranges across block group boundary [BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); |- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: |<--- BG1 --->|<--- BG2 --->| |//////|<-- Range to discard --->|/////| To discard that range, we have the following calls: btrfs_discard_extent() |- btrfs_map_block() | Returned bbio will end at BG1's end. As btrfs_map_block() | never returns result across block group boundary. |- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2d974619 |
|
23-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Use more straightforward way to calculate map length The old code goes: offset = logical - em->start; length = min_t(u64, em->len - offset, length); Where @length calculation is dependent on offset, it can take reader several more seconds to find it's just the same code as: offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); Use above code to make the length calculate independent from other variable, thus slightly increase the readability. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3470b5d |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add dedicated members for start and length of a block group The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf38be65 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move block_group_item::used to block group For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree. The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32ab3d1b |
|
18-Oct-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: remove pointless indentation in btrfs_read_sys_array() Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and then process it we could just bail out early if the read disk key wasn't a BTRFS_CHUNK_ITEM_KEY. This removes a level of indentation and makes the code nicer to read. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ae21692 |
|
18-Oct-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: reduce indentation in btrfs_may_alloc_data_chunk In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of type BTRFS_BLOCK_GROUP_DATA and if it is we process it. Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk we can negate the check and bail out early if it isn't. This makes the code a bit more readable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba8a9d07 |
|
10-Jul-2019 |
Chris Mason <clm@fb.com> |
Btrfs: delete the entire async bio submission framework Now that we're not using btrfs_schedule_bio() anymore, delete all the code that supported it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08635bae |
|
10-Jul-2019 |
Chris Mason <clm@fb.com> |
Btrfs: stop using btrfs_schedule_bio() btrfs_schedule_bio() hands IO off to a helper thread to do the actual submit_bio() call. This has been used to make sure async crc and compression helpers don't get stuck on IO submission. To maintain good performance, over time the IO submission threads duplicated some IO scheduler characteristics such as high and low priority IOs and they also made some ugly assumptions about request allocation batch sizes. All of this cost at least one extra context switch during IO submission, and doesn't fit well with the modern blkmq IO stack. So, this commit stops using btrfs_schedule_bio(). We may need to adjust the number of async helper threads for crcs and compression, but long term it's a better path. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4143cb8b |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add const function attribute For some reason the attribute is called __attribute_const__ and not __const, marks functions that have no observable effects on program state, IOW not reading pointers, just the arguments and calculating a value. Allows the compiler to do some optimizations, based on -Wsuggest-attribute=const . The effects are rather small, though, about 60 bytes decrese of btrfs.ko. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b105e927 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add __cold attribute to more functions The attribute can mark functions supposed to be called rarely if at all and the text can be moved to sections far from the other code. The attribute has been added to several functions already, this patch is based on hints given by gcc -Wsuggest-attribute=cold. The net effect of this patch is decrease of btrfs.ko by 1000-1300, depending on the config options. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aa6c0df7 |
|
02-Oct-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: print process name and pid that calls device scanning Its very helpful if we had logged the device scanner process name to debug the race condition between the systemd-udevd scan and the user initiated device forget command. This patch adds process name and pid to the scan message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add pid to the message ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1499166 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use has_single_bit_set for clarity Replace is_power_of_2 with the helper that is self-documenting and remove the open coded call in alloc_profile_is_valid. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e62869be |
|
25-Sep-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: balance: use term redundancy instead of integrity in message When balance reduces the number of copies of metadata, it reduces the redundancy, use the term redundancy instead of integrity. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0cac0ec |
|
16-Sep-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: get rid of unique workqueue helper functions Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c17add7a |
|
27-Aug-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Consider system chunk array size for new SYSTEM chunks For SYSTEM chunks, despite the regular chunk item size limit, there is another limit due to system chunk array size. The extra limit was removed in a refactoring, so add it back. Fixes: e3ecdb3fdecf ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a547890 |
|
12-Sep-2019 |
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> |
btrfs: fix balance convert to single on 32-bit host CPUs Currently, the command: btrfs balance start -dconvert=single,soft . on a Raspberry Pi produces the following kernel message: BTRFS error (device mmcblk0p2): balance: invalid convert data profile single This fails because we use is_power_of_2(unsigned long) to validate the new data profile, the constant for 'single' profile uses bit 48, and there are only 32 bits in a long on ARM. Fix by open-coding the check using u64 variables. Tested by completing the original balance command on several Raspberry Pis. Fixes: 818255feece6 ("btrfs: use common helper instead of open coding a bit test") CC: stable@vger.kernel.org # 4.20+ Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fab27359 |
|
24-Sep-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Fix a regression which we can't convert to SINGLE profile [BUG] With v5.3 kernel, we can't convert to SINGLE profile: # btrfs balance start -f -dconvert=single $mnt ERROR: error during balancing '/mnt/btrfs': Invalid argument # dmesg -t | tail validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1 BTRFS error (device dm-3): balance: invalid convert data profile single [CAUSE] With the extra debug output added, it shows that the @allowed bit is lacking the special in-memory only SINGLE profile bit. Thus we fail at that (profile & ~allowed) check. This regression is caused by commit 081db89b13cb ("btrfs: use raid_attr to get allowed profiles for balance conversion") and the fact that we don't use any bit to indicate SINGLE profile on-disk, but uses special in-memory only bit to help distinguish different profiles. [FIX] Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be the same as it was and fix the regression. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 081db89b13cb ("btrfs: use raid_attr to get allowed profiles for balance conversion") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1dc990df |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move dev_stats helpers to volumes.c The other dev stats functions are already there and the helpers are not used by anything else. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
784352fe |
|
21-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move math functions to misc.h Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d2979aa2 |
|
27-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use proper error values on allocation failure in clone_fs_devices Fix the fake ENOMEM return error code to the actual error in clone_fs_devices(). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a06dee4d |
|
27-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: proper error handling when invalid device is found in find_next_devid In a corrupted tree, if search for next devid finds the device with devid = -1, then report the error -EUCLEAN back to the parent function to fail gracefully. The tree checker will not catch this in case the devids are created using the following script: umount /btrfs dev1=/dev/sdb dev2=/dev/sdc mkfs.btrfs -fq -dsingle -msingle $dev1 mount $dev1 /btrfs _fail() { echo $1 exit 1 } while true; do btrfs dev add -f $dev2 /btrfs || _fail "add failed" btrfs dev del $dev1 /btrfs || _fail "del failed" dev_tmp=$dev1 dev1=$dev2 dev2=$dev_tmp done With output: BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506] BTRFS error (device sdb): block=313739198464 write time tree block corruption detected BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction) BTRFS warning (device sdb): Skipping commit of aborted transaction. BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> [ add script and messages ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f1136989 |
|
09-Aug-2019 |
Dan Carpenter <dan.carpenter@oracle.com> |
btrfs: fix error pointer check in __btrfs_map_block() The btrfs_get_chunk_map() never returns NULL, it returns error pointers. Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3b80a984 |
|
21-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: dev stat drop useless goto In the function btrfs_init_dev_stats() goto out is not needed, because the alloc has failed. So just return -ENOMEM. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
440630ea |
|
21-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: dev stats item key conversion per cpu type is not needed %found_key is not used, drop it since it hasn't been used since the beginning in 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit"). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f93c3997 |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: factor out sysfs code for updating sprout fsid Wrap the fsid renaming code and move it to sysfs.c. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5b28692e |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: factor out sysfs code for sending device uevent The device uevent belongs to the sysfs API. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ae4b9b4c |
|
07-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: opencode reset of all device stats __btrfs_reset_dev_stats() is a small helper function to reset devices stat values, and is used only once, instead just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e411a7d |
|
07-Aug-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: reset device stat using btrfs_dev_stat_set btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch open codes btrfs_dev_stat_reset(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aac0023c |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move basic block_group definitions to their own header This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
112974d4 |
|
19-Jul-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate() [BUG] Test case btrfs/156 fails since commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") with ENOSPC. [CAUSE] The ENOSPC is reported from btrfs_can_relocate(). This function will check: - If this block group is empty, we can relocate - If we can enough free space, we can relocate Above checks are valid but the following check is vague due to its implementation: - If and only if we can allocated a new block group to contain all the used space, we can relocate This design itself is OK, but the way to determine if we can allocate a new block group is problematic. btrfs_can_relocate() uses find_free_dev_extent() to find free space on a device. However find_free_dev_extent() only searches commit root and excludes dev extents allocated in current trans, this makes it unable to use dev extent just freed in current transaction. So for the following example, btrfs_can_relocate() will report ENOSPC: The example block group layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | // = Used bg, consider all bg is 100% used for easy calculation. And all block groups are SINGLE, on-disk bytenr is the same as the logical bytenr. 1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100 1M 129M 257M 385M 513M 550M |///////| |//////////|/////////| In transid 100, bg in [129M, 257M) get relocated to [385M, 513M) However transid 100 is not committed yet, so in dev commit tree, we still have the old dev extents layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | 2) Try to relocate bg [257M, 385M) We goes into btrfs_can_relocate(), no free space in current bgs, so we check if we can find large enough free dev extents. The first slot is [385M, 513M), but that is already used by new bg at [385M, 513M), so we continue search. The remaining slot is [512M, 550M), smaller than the bg's length 128M. So btrfs_can_relocate report ENOSPC. However this is over killed, in fact if we just skip btrfs_can_relocate() check, and go into regular relocation routine, at extent reservation time, if we can't find free extent, then we fallback to commit transaction, which will free up the dev extents and allow new block group to be created. [FIX] The fix here is to remove btrfs_can_relocate() completely. If we hit the false ENOSPC case just like btrfs/156, extent allocator will push harder by committing transaction and we will have space for new block group, avoiding the false ENOSPC. If we really ran out of space, we will hit ENOSPC at relocate_block_group(), and btrfs will just reports the ENOSPC error as usual. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
135da976 |
|
19-Jul-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Add comment for find_free_dev_extent_start() Since commit 6df9a95e6339 ("Btrfs: make the chunk allocator completely tree lockless") we search commit root of device tree to avoid deadlock. This introduced a safety feature, find_free_dev_extent_start() won't use dev extents which just get freed in current transaction. This safety feature makes sure we won't allocate new block group using just freed dev extents to break CoW. However, this feature also makes find_free_dev_extent_start() not reliable reporting free device space. Just add such comment to make later viewer careful about this behavior. This behavior makes one caller, btrfs_can_relocate() unreliable determining the device free space. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9e3246a5 |
|
19-Jul-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Unexport find_free_dev_extent_start() This function is only used locally in find_free_dev_extent(), no external callers. So unexport it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
99fccf33 |
|
02-Jul-2019 |
YueHaibing <yuehaibing@huawei.com> |
btrfs: remove set but not used variable 'offset' Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/volumes.c: In function __btrfs_map_block: fs/btrfs/volumes.c:6023:6: warning: variable offset set but not used [-Wunused-but-set-variable] It is not used any more since commit 343abd1c0ca9 ("btrfs: Use btrfs_get_io_geometry appropriately") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7cd4dd9 |
|
07-Aug-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix sysfs warning and missing raid sysfs directories In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
373c3b80 |
|
15-Jul-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: don't leak extent_map in btrfs_get_io_geometry() btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference on a extent_map, but on normal operation it does not drop this reference anymore. This leads to excessive kmemleak reports. Always call free_extent_map(), not just in the error case. Fixes: 5f1411265e16 ("btrfs: Introduce btrfs_io_geometry infrastructure") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8719aaae |
|
18-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move space_info to space-info.h Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
89b798ad |
|
02-Jun-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Use btrfs_get_io_geometry appropriately Presently btrfs_map_block is used not only to do everything necessary to map a bio to the underlying allocation profile but it's also used to identify how much data could be written based on btrfs' stripe logic without actually submitting anything. This is achieved by passing NULL for 'bbio_ret' parameter. This patch refactors all callers that require just the mapping length by switching them to using btrfs_io_geometry instead of calling btrfs_map_block with a special NULL value for 'bbio_ret'. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5f141126 |
|
02-Jun-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Introduce btrfs_io_geometry infrastructure Add a structure that holds various parameters for IO calculations and a helper that fills the values. This will help further refactoring and reduction of functions that in some way open-coded the calculations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9e967495 |
|
22-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: prevent send failures and crashes due to concurrent relocation Send always operates on read-only trees and always expected that while it is in progress, nothing changes in those trees. Due to that expectation and the fact that send is a read-only operation, it operates on commit roots and does not hold transaction handles. However relocation can COW nodes and leafs from read-only trees, which can cause unexpected failures and crashes (hitting BUG_ONs). while send using a node/leaf, it gets COWed, the transaction used to COW it is committed, a new transaction starts, the extent previously used for that node/leaf gets allocated, possibly for another tree, and the respective extent buffer' content changes while send is still using it. When this happens send normally fails with EIO being returned to user space and messages like the following are found in dmesg/syslog: [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253 [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984 Other times, less often, we hit a BUG_ON() because an extent buffer that send is using used to be a node, and while send is still using it, it got COWed and got reused as a leaf while send is still using, producing the following trace: [ 3478.466280] ------------[ cut here ]------------ [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806! [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000 [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3478.479003] Call Trace: [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0 [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs] [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs] [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs] [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs] [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs] [ 3478.483581] ? rq_clock_task+0x2e/0x60 [ 3478.484086] ? wake_up_new_task+0x1f3/0x370 [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0 [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0 [ 3478.486016] ? __fget+0x113/0x200 [ 3478.486467] ksys_ioctl+0x70/0x80 [ 3478.486911] __x64_sys_ioctl+0x16/0x20 [ 3478.487337] do_syscall_64+0x60/0x1b0 [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7 (...) [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001 (...) [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]--- Another possibility, much less likely to happen, is that send will not fail but the contents of the stream it produces may not be correct. To avoid this, do not allow send and relocation (balance) to run in parallel. In the long term the goal is to allow for both to be able to run concurrently without any problems, but that will take a significant effort in development and testing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7369b3f |
|
31-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add mask for all RAID1 types Preparatory patch for additional RAID1 profiles with more copies. The mask will contain 3-copy and 4-copy, most of the checks for plain RAID1 work the same for the other profiles. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b6f5d40 |
|
09-May-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Add comments on locking of several device-related fields Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cff82672 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: read number of data stripes from map only once There are several places that call nr_data_stripes, but this value does not change. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
158da513 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: refactor helper for bg flags to name conversion The helper lacks the btrfs_ prefix and the parameter is the raw blockgroup type, so none of the callers has to do the flags -> index conversion. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3ecdb3f |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: factor out devs_max setting in __btrfs_alloc_chunk Merge the repeated code before the if-else block. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
946c9256 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: factor out helper for counting data stripes Factor the sequence of ifs to a helper, the 'data stripes' here means the number of stripes without redundancy and parity. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44b28ada |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use raid_attr table for btrfs_bg_type_to_factor The factor is the number of copies. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6079e12c |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use raid_attr table to find profiles for integrity lowering Replace open coded list of the profiles by selecting them from the raid_attr table. The criteria are now more explicit, we need profiles that have more than 1 copy of the data or can reconstruct the data with a missing device. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
081db89b |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use raid_attr to get allowed profiles for balance conversion Iterate over the table and gather all allowed profiles for a given number of devices, instead of open coding. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc9a2ac7 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: use raid_attr in btrfs_chunk_max_errors The number of tolerated failures is stored in the raid_attr table, use it. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8bf1b67 |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove mapping tree structures indirection fs_info::mapping_tree is the physical<->logical mapping tree and uses the same underlying structure as extents, but is embedded to another structure. There are no other members and this indirection is useless. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49cc180c |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: raid56: allow the exact minimum number of devices for balance convert The minimum number of devices for RAID5 is 2, though this is only a bit expensive RAID1, and for RAID6 it's 3, which is a triple copy that works only 3 devices. mkfs.btrfs allows that and mounting such filesystem also works, so the conversion via balance filters is inconsistent with the others and we should not prevent it. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ee5f8ae |
|
17-May-2019 |
David Sterba <dsterba@suse.com> |
btrfs: fix minimum number of chunk errors for DUP The list of profiles in btrfs_chunk_max_errors lists DUP as a profile DUP able to tolerate 1 device missing. Though this profile is special with 2 copies, it still needs the device, unlike the others. Looking at the history of changes, thre's no clear reason why DUP is there, functions were refactored and blocks of code merged to one helper. d20983b40e828 Btrfs: fix writing data into the seed filesystem - factor code to a helper de11cc12df173 Btrfs: don't pre-allocate btrfs bio - unrelated change, DUP still in the list with max errors 1 a236aed14ccb0 Btrfs: Deal with failed writes in mirrored configurations - introduced the max errors, leaves DUP and RAID1 in the same group Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
65237ee3 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
163e97ee |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from device in btrfs_scrub_cancel_dev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f331a952 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from device in btrfs_rm_dev_item We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
196c9d8d |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in btrfs_run_dev_stats We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c466629 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in btrfs_finish_sprout We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6f8e0fc7 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in init_first_rw_device We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b7a2440 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from trans in btrfs_create_tree We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
17850759 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in read_one_dev We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9690ac09 |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in read_one_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ddaf1d5a |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in btrfs_check_chunk_valid We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ec0896c |
|
20-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: get fs_info from eb in should_balance_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e74e3993 |
|
27-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Factor out in_range macro This is used in more than one places so let's factor it out in ctree.h. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
60dfdf25 |
|
27-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Now that these functions no longer require a handle to transaction to inspect pending/pinned chunks the argument can be removed. At the same time also remove any surrounding code which acquired the handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c11b63e |
|
27-Mar-2019 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: replace pending/pinned chunks lists with io tree The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e75fd89 |
|
27-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Stop using call_rcu for device freeing btrfs_device structs are freed from RCU context since device iteration is protected by RCU. Currently this is achieved by using call_rcu since no blocking functions are called within btrfs_free_device. Future refactoring of pending/pinned chunks will require calling sleeping functions. This patch is in preparation for these changes by simply switching from RCU callbacks to explicit calls of synchronize_rcu and calling btrfs_free_device directly. This is functionally equivalent, making sure that there are no readers at that time. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39e264a4 |
|
25-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Populate ->orig_block_len during read_one_chunk Chunks read from disk currently don't get their ->orig_block_len member set, in contrast when a new chunk is allocated, the respective extent_map's ->orig_block_len is assigned the size of the stripe of this chunk. Let's apply the same strategy for chunks which are read from disk, not only does this codify the invariant that ->orig_block_len always contains the size of the stripe for a chunk (when the em belongs to the mapping tree). But it's also a preparatory patch for further work around tracking chunk allocation in an extent tree rather than pinned/pending lists. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
61d0d0d2 |
|
25-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink During device shrink pinned/pending chunks (i.e. those which have been deleted/created respectively, in the current transaction and haven't touched disk) need to be accounted when doing device shrink. Presently this happens after the main relocation loop in btrfs_shrink_device, which could lead to making another go in the body of the function. Since there is no hard requirement to perform pinned/pending chunks handling after the relocation loop, move the code before it. This leads to simplifying the code flow around - i.e. no need to use 'goto again'. A notable side effect of this change is that modification of the device's size requires a transaction to be started and committed before the relocation loop starts. This is necessary to ensure that relocation process sees the shrunk device size. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bbbf7243 |
|
25-Mar-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: combine device update operations during transaction commit We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab4ba2e1 |
|
07-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: Verify dev item [BUG] For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then kernel will just panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 #PF error: [normal kernel read fault] PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0 Call Trace: open_ctree+0x160d/0x2149 btrfs_mount_root+0x5b2/0x680 [CAUSE] If device extent verification finds a deivce with 0 total_bytes, then it assumes it's a seed dummy, then search for seed devices. But in this case, there is no seed device at all, causing NULL pointer. [FIX] Since this is caused by fuzzed image, let's go the tree-check way, just add a new verification for device item. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
075cb3c7 |
|
19-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: Check chunk item at tree block read time Since we have btrfs_check_chunk_valid() in tree-checker, let's do chunk item verification in tree-checker too. Since the tree-checker is run at endio time, if one chunk leaf fails chunk verification, we can still retry the other copy, making btrfs more robust to fuzzed image as we may still get a good chunk item. Also since we have done chunk verification in tree block read time, skip the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading chunk items from leaf. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82fc28fb |
|
19-Mar-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it By function, chunk item verification is more suitable to be done inside tree-checker. So move btrfs_check_chunk_valid() to tree-checker.c and export it. And since it's now moved to tree-checker, also add a better comment for what this function is doing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
290342f6 |
|
25-Mar-2019 |
Arnd Bergmann <arnd@arndb.de> |
btrfs: use BUG() instead of BUG_ON(1) BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here max_chunk_size); ^~~~~~~~~~~~~~ include/linux/kernel.h:860:36: note: expanded from macro 'min' #define min(x, y) __careful_cmp(x, y, <) ^ include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp' __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op)) ^ include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once' typeof(y) unique_y = (y); \ ^ fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^ fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning u64 max_chunk_size; ^ = 0 Change it to BUG() so clang can see that this code path can never continue. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0cc068e6 |
|
07-Mar-2019 |
David Sterba <dsterba@suse.com> |
btrfs: don't report readahead errors and don't update statistics As readahead is an optimization, all errors are usually filtered out, but still properly handled when the real read call is done. The commit 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") added REQ_RAHEAD to readpages() because that's only used for readahead (despite what one would expect from the callback name). This causes a flood of messages and inflated read error stats, so skip reporting in case it's readahead. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403 Reported-by: LimeTech <tomm@lime-technology.com> Fixes: 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com>
|
#
349ae63f |
|
18-Feb-2019 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: ensure that a DUP or RAID1 block group has exactly two stripes We recently had a customer issue with a corrupted filesystem. When trying to mount this image btrfs panicked with a division by zero in calc_stripe_length(). The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length() takes this value and divides it by the number of copies the RAID profile is expected to have to calculate the amount of data stripes. As a DUP profile is expected to have 2 copies this division resulted in 1/2 = 0. Later then the 'data_stripes' variable is used as a divisor in the stripe length calculation which results in a division by 0 and thus a kernel panic. When encountering a filesystem with a DUP block group and a 'num_stripes' value unequal to 2, refuse mounting as the image is corrupted and will lead to unexpected behaviour. Code inspection showed a RAID1 block group has the same issues. Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7faad6e2 |
|
08-Feb-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix comment its device list mutex not volume lock We have killed volume mutex (commit: dccdb07bc996 btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have escaped. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
228a73ab |
|
03-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: introduce new ioctl to unregister a btrfs device Support for a new command that can be used eg. as a command $ btrfs device scan --forget [dev]' (the final name may change though) to undo the effects of 'btrfs device scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty and is next to the SCAN ioctl. The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device (/dev/btrfs-control) to unregister one or all devices, devices that are not mounted. The argument is struct btrfs_ioctl_vol_args, ::name specifies the device path. To unregister all device, the path is an empty string. Again, the devices are removed only if they aren't part of a mounte filesystem. This new ioctl provides: - release of unwanted btrfs_fs_devices and btrfs_devices structures from memory if the device is not going to be mounted - ability to mount filesystem in degraded mode, when one devices is corrupted like in split brain raid1 - running test cases which would require reloading the kernel module but this is not possible eg. due to mounted filesystem or built-in Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
09ba3bc9 |
|
18-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: merge btrfs_find_device and find_device Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70bc7088 |
|
03-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_free_stale_devices() to get return value Preparatory patch to add ioctl that allows to forget a device (ie. reverse of scan). Refactors btrfs_free_stale_devices() to obtain return status. As this function can fail if it can't find the given path (returns -ENOENT) or trying to delete a mounted device (returns -EBUSY). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4319cd9 |
|
17-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_find_device() take fs_devices as argument btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6e927ceb |
|
17-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup btrfs_find_device_by_devspec() btrfs_find_device_by_devspec() finds the device by @devid or by @device_path. This patch makes code flow easy to read by open coding the else part and renames devpath to device_path. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d95a830c |
|
17-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: merge btrfs_find_device_missing_or_by_path() into parent btrfs_find_device_missing_or_by_path() is relatively small function, and its only parent btrfs_find_device_by_devspec() is small as well. Besides there are a number of find_device functions. Merge btrfs_find_device_missing_or_by_path() into its parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92900e51 |
|
26-Jan-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: fix potential oops in device_list_add alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its result before the check for IS_ERR() is a bad idea. Fixes: d1a63002829a4 ("btrfs: add members to fs_devices to track fsid changes") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b3922a8 |
|
07-Jan-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Use real device structure to verify dev extent [BUG] Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel message: BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0 BTRFS error (device dm-6): failed to verify dev extents against chunks: -117 BTRFS error (device dm-6): open_ctree failed [CAUSE] Commit cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") introduced strict check on dev extents. We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and only dependent on @devid to find the real device. For seed devices, we call clone_fs_devices() in open_seed_devices() to allow us search seed devices directly. However clone_fs_devices() just populates devices with devid and dev uuid, without populating other essential members, like disk_total_bytes. This makes any device returned by btrfs_find_device(fs_info, devid, NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev extents on the seed device will not pass the device boundary check. [FIX] This patch will try to verify the device returned by btrfs_find_device() and if it's a dummy then re-search in seed devices. Fixes: cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") CC: stable@vger.kernel.org # 4.19+ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52042d8e |
|
27-Nov-2018 |
Andrea Gelmini <andrea.gelmini@gelma.net> |
btrfs: Fix typos in comments and strings The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
15c82763 |
|
05-Dec-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove 1st shrink/grow phase from balance The first step of the rebalance process ensures there is 1MiB free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were: "man that's a little bonkers" "i don't think we even need that code anymore" "I think it was there to make sure we had room for the blank 1M at the beginning. I bet it goes all the way back to v0" "we just don't need any of that tho, i say we just delete it" Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process. Additionally, this patch makes the balance process more lightweight by removing a pair of shrink/grow operations which are rather expensive for heavily populated filesystems. This is mainly due to shrink requiring relocating block groups, involving heavy use of the btree. The intermediate shrink/grow can fail and leave the filesystem in a middle state that would need to be changed back by the user. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7073017a |
|
05-Dec-2018 |
Johannes Thumshirn <jthumshirn@suse.de> |
btrfs: use offset_in_page instead of open-coding it Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an offset into a page. So replace them by the offset_in_page() macro instead of open-coding it if they're not used as an alignment check. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb5583dd |
|
07-Sep-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: open code trivial locking helpers The dev-replace locking functions are now trivial wrappers around rw semaphore that can be used directly everywhere. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
53176dde |
|
04-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: dev-replace: remove custom read/write blocking scheme After the rw semaphore has been added, the custom blocking using ::blocking_readers and ::read_lock_wq is redundant. The blocking logic in __btrfs_map_block is replaced by extending the time the semaphore is held, that has the same blocking effect on writes as the previous custom scheme that waited until ::blocking_readers was zero. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da12fe54 |
|
27-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Refactor btrfs_merge_bio_hook This function really checks whether adding more data to the bio will straddle a stripe/chunk. So first let's give it a more appropraite name - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never used to just remove it. Thirdly, pages are submitted to either btree or data inodes so it's guaranteed that tree->ops is set so replace the check with an ASSERT. Finally, document the parameters of the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7333bd02 |
|
20-Nov-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: balance: print to system log when balance ends or is paused Print a kernel log message when the balance ends, either for cancel or completed or if it is paused. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
56fc37d9 |
|
20-Nov-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: balance: print args during start and resume The information about balance arguments is important for system audit, this patch prints the textual representation when balance starts or is resumed. Example command: $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs Example kernel log output: BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, simplify code ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f89e09cf |
|
20-Nov-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add helper to describe block group flags Factor out helper that describes block group flags from describe_relocation. The result will not be longer than the given size. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a8067c0 |
|
19-Nov-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix access to available allocation bits when starting balance The available allocation bits members from struct btrfs_fs_info are protected by a sequence lock, and when starting balance we access them incorrectly in two different ways: 1) In the read sequence lock loop at btrfs_balance() we use the values we read from fs_info->avail_*_alloc_bits and we can immediately do actions that have side effects and can not be undone (printing a message and jumping to a label). This is wrong because a retry might be needed, so our actions must not have side effects and must be repeatable as long as read_seqretry() returns a non-zero value. In other words, we were essentially ignoring the sequence lock; 2) Right below the read sequence lock loop, we were reading the values from avail_metadata_alloc_bits and avail_data_alloc_bits without any protection from concurrent writers, that is, reading them outside of the read sequence lock critical section. So fix this by making sure we only read the available allocation bits while in a read sequence lock critical section and that what we do in the critical section is repeatable (has nothing that can not be undone) so that any eventual retry that is needed is handled properly. Fixes: de98ced9e743 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits") Fixes: 14506127979a ("btrfs: fix a bogus warning when converting only data or metadata") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc5de4e7 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Handle final split-brain possibility during fsid change This patch lands the last case which needs to be handled by the fsid change code. Namely, this is the case where a multidisk filesystem has already undergone at least one successful fsid change i.e all disks have the METADATA_UUID incompat bit and power failure occurs as another fsid change is in progress. When such an event occurs, disks could be split in 2 groups. One of the groups will have both METADATA_UUID and CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs. The other group of disks will have only METADATA_UUID bit set and their fsid will be different than the one in disks in the first group. Here we look at the following cases: a) A disk from the first group is scanned first, so fs_devices is created with stale fsid/metdata_uuid. Then when a disk from the second group is scanned it needs to first check whether there exists such an fs_devices that has fsid_change set to true (because it was created with a disk having the CHANGING_FSID_V2 flag), the metadata_uuid and fsid of the fs_devices will be different (since it was created by a disk which already has had at least 1 successful fsid change) and finally the metadata_uuid of the fs_devices will equal that of the currently scanned disk (because metadata_uuid never really changes). When the correct fs_devices is found the information from the scanned disk will replace the current one in fs_devices since the scanned disk will have higher generation number. b) A disk from the second group is scanned so fs_devices is created as usual with differing fsid/metdata_uid. Then when a disk from the first group is scanned the code detects that it has both CHANGING_FSID_V2 and METADATA_UUID flags set and will search for fs_devices that has differing metadata_uuid/fsid and whose metadata_uuid is the same as that of the scanned device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a62d0f0 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Handle one more split-brain scenario during fsid change This commit continues hardening the scanning code to handle cases where power loss could have caused disks in a multi-disk filesystem to be in inconsistent state. Namely handle the situation that can occur when some of the disks in multi-disk fs have completed their fsid change i.e they have METADATA_UUID incompat flag set, have cleared the CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At the same time the other half of the disks will have their fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag. This is handled by introducing code in the scan path which: a) Handles the case when a device with CHANGING_FSID_V2 flag is scanned and as a result btrfs_fs_devices is created with matching fsid/metdata_uuid. Subsequently, when a device with completed fsid change is scanned it will detect this via the new code in find_fsid i.e that such an fs_devices exist that fsid_change flag is set to true, it's metadata_uuid/fsid match and the metadata_uuid of the scanned device matches that of the fs_devices. In this case, it's important to note that the devices which has its fsid change completed will have a higher generation number than the device with FSID_CHANGING_V2 flag set, so its superblock block will be used during mount. To prevent an assertion triggering because the sb used for mounting will have differing fsid/metadata_uuid than the ones in the fs_devices struct also add code in device_list_add which overwrites the values in fs_devices. b) Alternatively we can end up with a device that completed its fsid change be scanned first which will create the respective btrfs_fs_devices struct with differing fsid/metadata_uuid. In this case when a device with FSID_CHANGING_V2 flag set is scanned it will call the newly added find_fsid_inprogress function which will return the correct fs_devices. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d1a63002 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: add members to fs_devices to track fsid changes In order to gracefully handle split-brain scenario during fsid change (which are very unlikely, yet possible), two more pieces of information will be necessary: 1. The highest generation number among all devices registered to a particular btrfs_fs_devices 2. A boolean flag whether a given btrfs_fs_devices was created by a device which had the FSID_CHANGING_V2 flag set. This is a preparatory patch and just introduces the variables as well as code which sets them, their actual use is going to happen in a later patch. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de37aa51 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fsid/metadata_fsid fields from btrfs_info Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7239ff4b |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Introduce support for FSID change without metadata rewrite This field is going to be used when the user wants to change the UUID of the filesystem without having to rewrite all metadata blocks. This field adds another level of indirection such that when the FSID is changed what really happens is the current UUID (the one with which the fs was created) is copied to the 'metadata_uuid' field in the superblock as well as a new incompat flag is set METADATA_UUID. When the kernel detects this flag is set it knows that the superblock in fact has 2 UUIDs: 1. Is the UUID which is user-visible, currently known as FSID. 2. Metadata UUID - this is the UUID which is stamped into all on-disk datastructures belonging to this file system. When the new incompat flag is present device scanning checks whether both fsid/metadata_uuid of the scanned device match any of the registered filesystems. When the flag is not set then both UUIDs are equal and only the FSID is retained on disk, metadata_uuid is set only in-memory during mount. Additionally a new metadata_uuid field is also added to the fs_info struct. It's initialised either with the FSID in case METADATA_UUID incompat flag is not set or with the metdata_uuid of the superblock otherwise. This commit introduces the new fields as well as the new incompat flag and switches all users of the fsid to the new logic. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates in comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
64bc6c2a |
|
26-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove superfluous check form btrfs_remove_chunk It's unnecessary to check map->stripes[i].dev for NULL given its value is already set and dereferenced above the the check. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a9261d41 |
|
14-Oct-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: harden agaist duplicate fsid on scanned devices It's not that impossible to imagine that a device OR a btrfs image is copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after it's been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part / |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda $ dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 14.9G 0 disk |-sda4 8:4 1 4G 0 part / |-sda2 8:2 1 500M 0 part |-sda3 8:3 1 256M 0 part `-sda1 8:1 1 256M 0 part mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b50836ed |
|
04-Oct-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: introduce nparity raid_attr Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an nparity field in raid_attr. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da612e31 |
|
04-Oct-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: fix ncopies raid_attr for RAID56 RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These values are not yet used anywhere so there's no change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
baf92114 |
|
04-Oct-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: alloc_chunk: fix more DUP stripe size handling Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling" fixed calculating the stripe_size for a new DUP chunk. However, the same calculation reappears a bit later, and that one was not changed yet. The resulting bug that is exposed is that the newly allocated device extents ('stripes') can have a few MiB overlap with the next thing stored after them, which is another device extent or the end of the disk. The scenario in which this can happen is: * The block device for the filesystem is less than 10GiB in size. * The amount of contiguous free unallocated disk space chosen to use for chunk allocation is 20% of the total device size, or a few MiB more or less. An example: - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB) - There's 1578MiB unallocated raw disk space left in one contiguous piece. In this case stripe_size is first calculated as 789MiB, (half of 1578MiB). Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we enter the if block. Now stripe_size value is immediately overwritten while calculating an adjusted value based on max_chunk_size, which ends up as 788MiB. Next, the value is rounded up to a 16MiB boundary, 800MiB, which is actually more than the value we had before. However, the last comparison fails to detect this, because it's comparing the value with the total amount of free space, which is about twice the size of stripe_size. In the example above, this means that the resulting raw disk space being allocated is 1600MiB, while only a gap of 1578MiB has been found. The second device extent object for this DUP chunk will overlap for 22MiB with whatever comes next. The underlying problem here is that the stripe_size is reused all the time for different things. So, when entering the code in the if block, stripe_size is immediately overwritten with something else. If later we decide we want to have the previous value back, then the logic to compute it was copy pasted in again. With this change, the value in stripe_size is not unnecessarily destroyed, so the duplicated calculation is not needed any more. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23f0ff1e |
|
04-Oct-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: alloc_chunk: improve chunk size variable name The variable num_bytes is really a way too generic name for a variable in this function. There are a dozen other variables that hold a number of bytes as value. Give it a name that actually describes what it does, which is holding the size of the chunk that we're allocating. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f29df4f |
|
04-Oct-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: alloc_chunk: do not refurbish num_bytes The variable num_bytes is used to store the chunk length of the chunk that we're allocating. Do not reuse it for something really different in the same function. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc8a168a |
|
08-Nov-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Check for missing device before bio submission in btrfs_map_bio Before btrfs_map_bio submits all stripe bios it does a number of checks to ensure the device for every stripe is present. However, it doesn't do a DEV_STATE_MISSING check, instead this is relegated to the lower level btrfs_schedule_bio (in the async submission case, sync submission doesn't check DEV_STATE_MISSING at all). Additionally btrfs_schedule_bios does the duplicate device->bdev check which has already been performed in btrfs_map_bio. This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and removes the duplicate device->bdev check. Doing so ensures that no bio cloning/submission happens for both async/sync requests in the face of missing device. This makes the async io submission path slightly shorter in terms of instruction count. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
60ca842e |
|
16-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: rename and export get_chunk_map The Btrfs swap code is going to need it, so give it a btrfs_ prefix and make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eede2bf3 |
|
03-Nov-2016 |
Omar Sandoval <osandov@fb.com> |
Btrfs: prevent ioctls from interfering with a swap file A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05a37c48 |
|
05-Oct-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Make sure no dev extent is beyond device boundary Add extra dev extent end check against device boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5eb19381 |
|
05-Oct-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Make sure there is no overlap of dev extents at mount time Enhance btrfs_verify_dev_extents() to remember previous checked dev extents, so it can verify no dev extents can overlap. Analysis from Hans: "Imagine allocating a DATA|DUP chunk. In the chunk allocator, we first set... max_stripe_size = SZ_1G; max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ... which is 10GiB. Then... /* we don't want a chunk larger than 10% of writeable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); Imagine we only have one 7880MiB block device in this filesystem. Now max_chunk_size is down to 788MiB. The next step in the code is to search for max_stripe_size * dev_stripes amount of free space on the device, which is in our example 1GiB * 2 = 2GiB. Imagine the device has exactly 1578MiB free in one contiguous piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail Next we recalculate the stripe_size (which is actually the device extent length), based on the actual maximum amount of available raw disk space: stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); stripe_size is now 789MiB Next we do... data_stripes = num_stripes / ncopies ...where data_stripes ends up as 1, because num_stripes is 2 (the amount of device extents we're going to have), and DUP has ncopies 2. Next there's a check... if (stripe_size * data_stripes > max_chunk_size) ...which matches because 789MiB * 1 > 788MiB. We go into the if code, and next is... stripe_size = div_u64(max_chunk_size, data_stripes); ...which resets stripe_size to max_chunk_size: 788MiB Next is a fun one... /* bump the answer up to a 16MB boundary */ stripe_size = round_up(stripe_size, SZ_16M); ...which changes stripe_size from 788MiB to 800MiB. We're not done changing stripe_size yet... /* But don't go higher than the limits we found while searching * for free extents */ stripe_size = min(devices_info[ndevs - 1].max_avail, stripe_size); This is bad. max_avail is twice the stripe_size (we need to fit 2 device extents on the same device for DUP). The result here is that 800MiB < 1578MiB, so it's unchanged. However, the resulting DUP chunk will need 1600MiB disk space, which isn't there, and the second dev_extent might extend into the next thing (next dev_extent? end of device?) for 22MiB. The last shown line of code relies on a situation where there's twice the value of stripe_size present as value for the variable stripe_size when it's DUP. This was actually the case before commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote: "[...] in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size." In the previous version of the code, the 16MiB alignment (why is this done, by the way?) would result in a 50% chance that it would actually do an 8MiB alignment for the individual dev_extents, since it was operating on double the size. Does this matter? Does it matter that stripe_size can be set to anything which is not 16MiB aligned because of the amount of remaining available disk space which is just taken? What is the main purpose of this round_up? The most straightforward thing to do seems something like... stripe_size = min( div_u64(devices_info[ndevs - 1].max_avail, dev_stripes), stripe_size ) ..just putting half of the max_avail into stripe_size." Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/ Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ add analysis from report ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7fb2eced |
|
24-Aug-2018 |
David Sterba <dsterba@suse.com> |
btrfs: open code btrfs_dev_replace_clear_lock_blocking There's a single caller and the function name does not say it's actually taking the lock, so open coding makes it more explicit. For now, btrfs_dev_replace_read_lock is used instead of read_lock so it's paired with the unlocking wrapper in the same block. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
818255fe |
|
21-Sep-2018 |
David Sterba <dsterba@suse.com> |
btrfs: use common helper instead of open coding a bit test The helper does the same math and we take care about the special case when flags is 0 too. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07e1ce09 |
|
22-Aug-2018 |
Liu Bo <bo.liu@linux.alibaba.com> |
Btrfs: extent_map: use rb_first_cached rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). As evict_inode_truncate_pages() removes all extent mapping by always looking for the first rb entry, it's helpful to use rb_first_cached instead. For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a27a94c2 |
|
02-Sep-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly Instead of returning an error value and using one of the parameters for returning the actual object we are interested in just refactor the function to directly return btrfs_device *. Also bubble up the error handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into btrfs_rm_device. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c050407 |
|
02-Sep-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_find_device_missing_or_by_path return directly a device This function returns a numeric error value and additionally the device found in one of its input parameters. Simplify this by making the function directly return a pointer to btrfs_device. Additionally adjust the caller to handle the case when we want to remove the 'missing' device and ENOENT is returned to return the expected positive error value, parsed by progs. Finally, unexport the function since it's not called outside of volume.c. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b444ad46 |
|
02-Sep-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_find_device_by_path return struct btrfs_device Currently this function returns an error code as well as uses one of its arguments as a return value for struct btrfs_device. Change the function so that it returns btrfs_device directly and use the usual "encode error in pointer" mechanics if something goes wrong. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1da73967 |
|
09-Aug-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add helper to obtain number of devices with ongoing dev-replace When the replace is running the fs_devices::num_devices also includes the replaced device, however in some operations like device delete and balance it needs the actual num_devices without the repalced devices. The function btrfs_num_devices() just provides that. And here is a scenario how balance and repalce items could co-exist: Consider balance is started and paused, now start the replace followed by a unmount or system power-cycle. During following mount, the open_ctree() first restarts the balance so it must check for the device replace otherwise our num_devices calculation will be wrong. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16220c46 |
|
09-Aug-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add assertions where number of devices could go below 0 In preparation to add helper function to deduce the num_devices with replace running, use assert instead of BUG_ON or WARN_ON. The number of devices would not normally drop to 0 due to other checks so the assert is sufficient. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, adjust the assert condition ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
801660b0 |
|
06-Aug-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: btrfs_shrink_device should call commit transaction at the end Test case btrfs/164 reports use-after-free: [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP .. [ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs] [ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs] [ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs] [ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs] Reason for this is that btrfs_shrink_device adds the resized device to the fs_devices::resized_devices after it has called the last commit transaction. So the list fs_devices::resized_devices is not empty when btrfs_shrink_device returns. Now the parent function btrfs_rm_device calls: btrfs_close_bdev(device); call_rcu(&device->rcu, free_device_rcu); and then does the transactio ncommit. It goes through the fs_devices::resized_devices in btrfs_update_commit_device_size and leads to use-after-free. Fix this by making sure btrfs_shrink_device calls the last needed btrfs_commit_transaction before the return. This is consistent with what the grow counterpart does and this makes sure the on-disk state is persistent when the function returns. Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39379faa |
|
26-Jul-2018 |
Naohiro Aota <naota@elisp.net> |
btrfs: revert fs_devices state on error of btrfs_init_new_device When btrfs hits error after modifying fs_devices in btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it leaves everything as is, but frees allocated btrfs_device. As a result, fs_devices->devices and fs_devices->alloc_list contain already freed btrfs_device, leading to later use-after-free bug. Error path also messes the things like ->num_devices. While they go back to the original value by unscanning btrfs devices, it is safe to revert them here. Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling") Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
64f64f43 |
|
31-Jul-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Exit gracefully when chunk map cannot be inserted to the tree It's entirely possible that a crafted btrfs image contains overlapping chunks. Although we can't detect such problem by tree-checker, it's not a catastrophic problem, current extent map can already detect such problem and return -EEXIST. We just only need to exit gracefully and fail the mount. Reported-by: Xu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf90d884 |
|
31-Jul-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Introduce mount time chunk <-> dev extent mapping check This patch will introduce chunk <-> dev extent mapping check, to protect us against invalid dev extents or chunks. Since chunk mapping is the fundamental infrastructure of btrfs, extra check at mount time could prevent a lot of unexpected behavior (BUG_ON). Reported-by: Xu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403 Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
672d5990 |
|
02-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Use wrapper macro for rcu string to remove duplicate code Cleanup patch and no functional changes. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97aff912 |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_finish_chunk_alloc It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f4208794 |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info form btrfs_free_chunk It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4f5ad7bd |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdev This function is always passed a well-formed tgtdevice so the fs_info can be referenced from there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d6507cf1 |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_assign_next_active_device It can be referenced from the passed 'device' argument which is always a well-formed device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5495f195 |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove fs_info argument from update_dev_stat_item It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
68a9db5f |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdev It can be referenced from the passed srcdev argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e87e856 |
|
20-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info argument from btrfs_add_dev_item It can be referenced form the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
315409b0 |
|
04-Jul-2018 |
Gu Jinxiang <gujx@cn.fujitsu.com> |
btrfs: validate type when reading a chunk Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an image that has an invalid chunk type but does not return an error. Add chunk type check in btrfs_check_chunk_valid, to detect the wrong type combinations. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839 Reported-by: Xu Wen <wen.xu@gatech.edu> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
46df06b8 |
|
13-Jul-2018 |
David Sterba <dsterba@suse.com> |
btrfs: refactor block group replication factor calculation to a helper There are many places that open code the duplicity factor of the block group profiles, create a common helper. This can be easily extended for more copies. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
321a4bf7 |
|
16-Jul-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use the assigned fs_devices instead of the dereference We have assigned the %fs_info->fs_devices in %fs_devices as its not modified just use it for the mutex_lock(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
36350e95 |
|
12-Jul-2018 |
Gu Jinxiang <gujx@cn.fujitsu.com> |
btrfs: return device pointer from btrfs_scan_one_device Return device pointer (with the IS_ERR semantics) from btrfs_scan_one_device so we don't have to return in through pointer. And since btrfs_fs_devices can be obtained from btrfs_device, return that. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ fixed conflics after recent changes to btrfs_scan_one_device ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5194e34 |
|
19-Jun-2018 |
David Sterba <dsterba@suse.com> |
btrfs: lift uuid_mutex to callers of btrfs_open_devices Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
899f9307 |
|
19-Jun-2018 |
David Sterba <dsterba@suse.com> |
btrfs: lift uuid_mutex to callers of btrfs_scan_one_device Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7bcb8164 |
|
29-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use device_list_mutex when removing stale devices btrfs_free_stale_devices() finds a stale (not opened) device matching path in the fs_uuid list. We are already under uuid_mutex so when we check for each fs_devices, hold the device_list_mutex too. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fa6d2ae5 |
|
29-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename local devices for fs_devices in btrfs_free_stale_devices( Over the years we named %fs_devices and %devices to represent the struct btrfs_fs_devices and the struct btrfs_device. So follow the same scheme here too. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c6d173e |
|
29-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: extend locked section when adding a new device in device_list_add Make sure the device_list_lock is held the whole time: * when the device is being looked up * new device is initialized and put to the list * the list counters are updated (fs_devices::opened, fs_devices::total_devices) Signed-off-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4306a974 |
|
28-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: do btrfs_free_stale_devices outside of device_list_add btrfs_free_stale_devices() looks for device path reused for another filesystem, and deletes the older fs_devices::device entry. In preparation to handle locking in device_list_add, move btrfs_free_stale_devices outside as these two functions serve a different purpose. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
959b1c04 |
|
28-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: close devices without offloading to a temporary list Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") btrfs_show_devname no longer takes device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix a possible umount deadlock") aimed to fix no longer exists, we can free the devices immediatelly and remove the code that does the pending work. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
621567a2 |
|
09-Jul-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Remove unused function btrfs_account_dev_extents_size This function is not used since the alloc_start parameter has been obsoleted in commit 0d0c71b317207082856 ("btrfs: obsolete and remove mount option alloc_start"). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4993e64 |
|
03-Jul-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix in-memory value of total_devices after seed device deletion In case of deleting the seed device the %cur_devices (seed) and the %fs_devices (parent) are different. Now, as the parent fs_devices::total_devices also maintains the total number of devices including the seed device, so decrement its in-memory value for the successful seed delete. We are already updating its corresponding on-disk btrfs_super_block::number_devices value. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7f663fa |
|
29-Jun-2018 |
David Sterba <dsterba@suse.com> |
btrfs: prune unused includes Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
694c51fb |
|
02-Jul-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop unnecessary variable in btrfs_init_new_device There is only usage of the declared devices variable, instead use its value directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5da54bc1 |
|
02-Jul-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use a temporary variable for fs_devices in btrfs_init_new_device There are many instances of the %fs_info->fs_devices pointer dereferences, use a temporary variable instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fce466ea |
|
03-Jul-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: tree-checker: Verify block_group_item A crafted image with invalid block group items could make free space cache code to cause panic. We could detect such invalid block group item by checking: 1) Item size Known fixed value. 2) Block group size (key.offset) We have an upper limit on block group item (10G) 3) Chunk objectid Known fixed value. 4) Type Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA. No more than 1 bit set for profile type. 5) Used space No more than the block group size. This should allow btrfs to detect and refuse to mount the crafted image. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43a7e99d |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_force_chunk_alloc It can be referenced from the passed transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
451a2c13 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from check_system_chunk It can be referenced from trans since the function is always called within a transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c216b203 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_alloc_chunk It can be referenced from trans since the function is always called within a transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a98ec01 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_remove_block_group This function is always called with a valid transaction handle from where we can reference fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e7e02096 |
|
20-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info from btrfs_make_block_group This function is always called with a valid transaction handle from where we can reference the fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
20c5bbc6 |
|
20-Jun-2018 |
David Sterba <dsterba@suse.com> |
btrfs: restore uuid_mutex in btrfs_open_devices Commit 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") switched to device_list_mutex as we need that for the device list traversal, but we also need uuid_mutex to protect access to fs_devices::opened to be consistent with other users of that. Fixes: 542c5908abfe84f7b4c1 ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cdb345a8 |
|
29-May-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info argument from btrfs_uuid_tree_add This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6dac13f8 |
|
15-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add prefix "balance:" for log messages Kernel logs are very important for the forensic investigations of the issues in general make it easy to use it. This patch adds 'balance:' prefix so that it can be easily searched. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d9a071f0 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev Use a local btrfs_fs_devices variable to access the structure. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab5c2f65 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev Delete the uuid_mutex lock here as this thread accesses the btrfs_fs_devices::devices only (counters or called functions do a list traversal). And the device_list_mutex lock is already taken. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
542c5908 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices is just limited to openning all the devices under for given fsid, so we don't need uuid_mutex. Instead it should hold the device_list_mutex as it updates the members of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3dd0f7a3 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: document uuid_mutex uasge in read_chunk_tree read_chunk_tree() calls read_one_dev(), but for seed device we have to search the fs_uuids list, so we need the uuid_mutex. Add a comment comment, so that we can improve this part. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41a52a0f |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use existing cur_devices, cleanup btrfs_rm_device Instead of de-referencing the device->fs_devices use cur_devices which points to the same fs_devices and does not change. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6ed73bc |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: reduce uuid_mutex critical section while scanning devices The generic block device lookup or cleanup does not need the uuid mutex, that's only for the device_list_add. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fcf6e2b |
|
07-May-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove redundant btrfs_balance_control::fs_info The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
63a9c7b9 |
|
04-May-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove devid parameter from btrfs_rmap_block This function is used in only one place and devid argument is always passed 0. So just remove it, similarly to how it was removed in the userspace code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0338dff6 |
|
27-Apr-2018 |
Gu Jinxiang <gujx@cn.fujitsu.com> |
btrfs: do reverse path readahead in btrfs_shrink_device In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to READA_FORWARD. But I think READA_BACK is correct. Since: 1. key.offset is set to (u64)-1 2. after btrfs_search_slot, btrfs_previous_item is called So, for readahead previous items, READA_BACK is the correct one. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9fbcaa2 |
|
25-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::mindev_error so that btrfs_raid_array can maintain the error code to return if the minimum number of devices condition is not met while trying to delete a device in the given raid. And so we can drop btrfs_raid_mindev_error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41a6e891 |
|
25-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move btrfs_raid_group values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed23467b |
|
25-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::raid_name so that btrfs_raid_array can maintain the name of the raid type, and so we can drop btrfs_raid_type_names. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
833aae18 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: open code set_balance_control The helper is quite simple and I'd like to see the locking in the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1354e1a1 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: use mutex in btrfs_resume_balance_async While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
008ef096 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf7d20f4 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move and comment read-only check in btrfs_cancel_balance Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. The last case is when the balance is paused after mount but it's read-only and cancelling would want to delete the item. The test is moved after the check if balance is running at all, as it looks more logical to report "no balance running" instead of "read-only filesystem". Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3009a62f |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dccdb07b |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0fecc23 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
149196a2 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup helpers that reset balance state The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eee95e3f |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add sanity check when resuming balance after mount Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a17c95df |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move clearing of EXCL_OP out of __cancel_balance Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72b81abf |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move volume_mutex to callers of btrfs_rm_device Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d48f39d5 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static The function logically belongs there and there's only a single caller, no need to export it. No code changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a425f9d4 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: export and rename free_device The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fc4749d |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b5185197 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup btrfs_rm_device() promote fs_devices pointer This function uses fs_info::fs_devices number of time, however we declare and use it only at the end, instead do it in the beginning of the function and use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
636d2c9d |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup find_device() drop list_head pointer find_device() declares struct list_head *head pointer and used only once, instead just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
897fb573 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename __btrfs_open_devices to open_fs_devices __btrfs_open_devices() is un-exported drop __ prefix and rename it to open_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0226e0eb |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename __btrfs_close_devices to close_fs_devices __btrfs_close_devices() is un-exported, drop the __ prefix and rename it to close_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f117e290 |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup __btrfs_open_devices() drop head pointer __btrfs_open_devices() declares struct list_head *head, however head is used only once, instead use btrfs_fs_devices::devices directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4babc5e |
|
11-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename struct btrfs_fs_devices::list btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic name 'list' makes it's search very difficult, rename it to fs_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
89595e80 |
|
18-Apr-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add comment about BTRFS_FS_EXCL_OP Adds comments about BTRFS_FS_EXCL_OP to existing comments about the device locks. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02ee654d |
|
17-May-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix crash when trying to resume balance without the resume flag We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1d7c514 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7e79cb86 |
|
23-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: split dev-replace locking helpers for read and write The current calls are unclear in what way btrfs_dev_replace_lock takes the locks, so drop the argument, split the helpers and use similar naming as for read and write locks. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a32bf9a3 |
|
15-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: use lockdep_assert_held for mutexes Using lockdep_assert_held is preferred, replace mutex_is_locked. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
75cb379d |
|
20-Mar-2018 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: defer adding raid type kobject until after chunk relocation Any time the first block group of a new type is created, we add a new kobject to sysfs to hold the attributes for that type. Kobject-internal allocations always use GFP_KERNEL, making them prone to fs-reclaim races. While it appears as if this can occur any time a block group is created, the only times the first block group of a new type can be created in memory is at mount and when we create the first new block group during raid conversion. This patch adds a new list to track pending kobject additions and then handles them after we do chunk relocation. Between relocating the target chunk (or forcing allocation of a new chunk in the case of data) and removing the old chunk, we're in a safe place for fs-reclaim to occur. We're holding the volume mutex, which is already held across page faults, and the delete_unused_bgs_mutex, which will only stall the cleaner thread. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ba0ae78 |
|
14-Mar-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: drop optimal argument from find_live_mirror() Drop optimal argument from the function find_live_mirror() as we can deduce it in the function itself. Also rename optimal to preferred_mirror. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
99f92a7c |
|
14-Mar-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: drop num argument from find_live_mirror() Obtain the stripes info from the map directly and so no need to pass it as an argument. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba89b802 |
|
30-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk When checking the minimal nr_devs, there is one dead and meaningless condition: if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This condition is meaningless, @devs_increment has nothing to do with @sub_stripes. In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1 (RAID10) already has the @devs_increment set to 2. So no need to multiple it by @sub_stripes. And above condition is also dead. For RAID10, @devs_increment * @sub_stripes equals 4, which is also the @devs_min of RAID10. For other profiles, @sub_stripes is always 1, and since @ndevs is rounded down to @devs_increment, the condition will always be true. Remove the meaningless condition to make later reader wander less. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c829b72 |
|
07-Mar-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add define for oldest generation Some functions can filter metadata by the generation. Add a define that will annotate such arguments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b99b115 |
|
26-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids This function btrfs_close_extra_devices() is about freeing extra devids which once it may have belonged to this filesystem. So rename it and add the comment. The _devid suffix is appropriate as this function won't handle devices which are outside of the filesytem being mounted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16db5758 |
|
20-Feb-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: remove assert in btrfs_init_dev_replace_tgtdev() In the same function we just ran btrfs_alloc_device() which means the btrfs_device::resized_list is sure to be empty and we are protected with the btrfs_fs_info::volume_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ffc5a379 |
|
19-Feb-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add (the only possible) __exit annotation Recently, the __init annotations have been added. There's unfortunatelly only one case where we can add __exit, because most of the cleanup helpers are also called from the __init phase. As the __exit annotated functions get discarded completely for a built-in code, we'd miss them from the init phase. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e72ee88 |
|
30-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index() Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. 3) Remove function get_block_group_index() Really no need to do such a simple thing as an exported function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
793ff2c8 |
|
30-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: volumes: Cleanup stripe size calculation Cleanup the following things: 1) open coded SZ_16M round up 2) use min() to replace open-coded size comparison 3) code style Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b1b8e386 |
|
22-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: insert newly opened device to the end of the list Add opened device to the tail of dev_alloc_list instead of head, so that it maintains the same order as dev_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f8e10cd3 |
|
22-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: keep device list sorted By maintaining the device list sorted lets us reproduce the problems related to missing chunk in the degraded mode much more consistent. So fix this by sorting the devices by devid within the kernel. So that we know which device is assigned to the struct fs_info::latest_bdev when all the devices are having and same SB generation. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4117f207 |
|
21-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option Enospc_debug makes extent allocator print more debug messages, however for chunk allocation, there is no debug message for enospc_debug at all. This patch will add message for the following parts of chunk allocator: 1) No rw device at all Quite rare, but at least output one message for this case. 2) Not enough space for some device This debug message is quite handy for unbalanced disks with stripe based profiles (RAID0/10/5/6). 3) Not enough free devices This debug message should tell us if current chunk allocator is working correctly under minimal device requirements. Although in most cases, we will hit other ENOSPC before we even hit a chunk allocator ENOSPC, but such debug info won't help. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9b919b1 |
|
07-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used We already pass the btrfs_transaction which references fs_info so no need to pass the later as an argument. Also use the opportunity to shorten transaction->trans. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
15fc1283 |
|
12-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume() btrfs_init_dev_replace_tgtdev_for_resume() initializes replace target device in a few simple steps, so do it at the parent function. Moreover, there isn't any other caller so just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
062d4d1f |
|
30-Jan-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92e222df |
|
05-Feb-2018 |
Hans van Kranenburg <hans.van.kranenburg@mendix.com> |
btrfs: alloc_chunk: fix DUP stripe size handling In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd649f10 |
|
30-Jan-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3acbcbfc |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop devid as device_list_add() arg As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e124ece5 |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: get device pointer from device_list_add() Instead of pointer to btrfs_fs_devices as an arg in device_list_add() better to get pointer to btrfs_device as return value, then we have both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is needed to handle reappearing missing device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2788d2f |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: set the total_devices in device_list_add() There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
327f18cc |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move pr_info into device_list_add Commit 60999ca4b403 ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8367db3 |
|
18-Jan-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: make btrfs_free_stale_devices() to match the path The btrfs_free_stale_devices() is updated to match for the given device path and delete it. (It searches for only unmounted list of devices.) Also drop the comment about different path being used for the same device, since now we will have cli to clean any device that's not a concern any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d34097f |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_free_stale_devices() arg to skip_dev No functional changes. Rename btrfs_free_stale_devices() arg to skip_dev, so that it reflects what that arg for. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
522f1b45 |
|
18-Jan-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: make btrfs_free_stale_devices() argument optional This updates btrfs_free_stale_devices() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38cf665d |
|
18-Jan-2018 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: make btrfs_free_stale_device() to iterate all stales Let the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Also rename btrfs_free_stale_device() to btrfs_free_stale_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a848b3e5 |
|
18-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: no need to check for btrfs_fs_devices::seeding There is no need to check for btrfs_fs_devices::seeding when we have checked for btrfs_fs_devices::opened, because we can't sprout without its seed FS being opened. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cbf26da |
|
17-Jan-2018 |
Matthew Wilcox <willy@infradead.org> |
btrfs: Remove unused readahead spinlock The reada_lock in struct btrfs_device was only initialised, and not actually used. That's good because there's another lock also called reada_lock in the btrfs_fs_info that was quite heavily used. Remove this one. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c94da9d |
|
09-Jan-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup btrfs_free_stale_device() usage We call btrfs_free_stale_device() only when we alloc a new struct btrfs_device (ret=1), so move it closer to where we alloc the new device. Also drop the comments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a6f93c71 |
|
15-Nov-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: avoid losing data raid profile when deleting a device We've avoided data losing raid profile when doing balance, but it turns out that deleting a device could also result in the same problem. Say we have 3 disks, and they're created with '-d raid1' profile. - We have chunk P (the only data chunk on the empty btrfs). - Suppose that chunk P's two raid1 copies reside in disk A and disk B. - Now, 'btrfs device remove disk B' btrfs_rm_device() -> btrfs_shrink_device() -> btrfs_relocate_chunk() #relocate any chunk on disk B to other places. - Chunk P will be removed and a new chunk will be created to hold those data, but as chunk P is the only one holding raid1 profile, after it goes away, the new chunk will be created as single profile which is our default profile. This fixes the problem by creating an empty data chunk before relocating the data chunk. Metadata/System chunk are supposed to have non-zero bytes all the time so their raid profile is preserved. Reported-by: James Alandt <James.Alandt@wdc.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05a5c55d |
|
15-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: minor style cleanups in btrfs_scan_one_device Assign ret = -EINVAL where it is actually required. Remove { } around single line if else code. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8810f751 |
|
02-Jan-2018 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: make raid6 rebuild retry more There is a scenario that can end up with rebuild process failing to return good content, i.e. suppose that all disks can be read without problems and if the content that was read out doesn't match its checksum, currently for raid6 btrfs at most retries twice, - the 1st retry is to rebuild with all other stripes, it'll eventually be a raid5 xor rebuild, - if the 1st fails, the 2nd retry will deliberately fail parity p so that it will do raid6 style rebuild, however, the chances are that another non-parity stripe content also has something corrupted, so that the above retries are not able to return correct content, and users will think of this as data loss. More seriouly, if the loss happens on some important internal btree roots, it could refuse to mount. This extends btrfs to do more retries and each retry fails only one stripe. Since raid6 can tolerate 2 disk failures, if there is one more failure besides the failure on which we're recovering, this can always work. The worst case is to retry as many times as the number of raid6 disks, but given the fact that such a scenario is really rare in practice, it's still acceptable. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6528b99d |
|
18-Dec-2017 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: factor btrfs_check_rw_degradable() to check given device Update btrfs_check_rw_degradable() to check against the given device if its lost. We can use this function to know if the volume is going to be in degraded mode OR failed state, when the given device fails. Which is needed when we are handling the device failed state. A preparatory patch does not affect the flow as such. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ enhance comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3e798068 |
|
11-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove pair of bio_get/put in btrfs_schedule_bio This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios during submit_bio, add some extra bio checks"). However, holding a reference on a bio is necessary only if it's going to be referenced after the submit_bio returns and the bio is completed. In this particular instance this is not the case so there is no need to hold an extra reference since we directly return. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
401e29c1 |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e6e674bd |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by : Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e12c9621 |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebbede42 |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38b5f68e |
|
29-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop btrfs_device::can_discard to query directly We can query the bdev directly when needed at btrfs_discard_extent() so drop btrfs_device::can_discard. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0fb08bcc |
|
09-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: factor __btrfs_open_devices() to create btrfs_open_one_device() No functional changes, create btrfs_open_one_device() from __btrfs_open_devices(). This is a preparatory work to add dynamic device scan. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ minor whitespace fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f050db4 |
|
09-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move check for device generation to the last No functional changes. This helps to move the entire section into a new function. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
71f8a8d2 |
|
09-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: set fs_devices->seed directly This is in preparation to move a section of code in __btrfs_open_devices() into a new function so that it can be reused. As we set seeding if any of the device is having SB flag BTRFS_SUPER_FLAG_SEEDING, so do it in the device list loop itself. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
08ffcae8 |
|
19-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: simplify btrfs_close_bdev Split the conditions a bit. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c6b1c4d |
|
16-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: document device locking Overview of the main locks protecting various device-related structures. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c4cf6c9 |
|
30-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: simplify exit paths in btrfs_init_new_device Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
55de4803 |
|
30-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use free_device where opencoded Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
48dae9cf |
|
30-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: introduce free_device helper A helper to free a device and all it's dynamically allocated members, like the rcu_string name or flush_bio. This is going to replace all open coded places. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f06c5965 |
|
06-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: rename device free rcu helper to free_device_rcu Make it clear that it is an RCU helper, we want to use the name free_device for a wrapper freeing all device members. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c74a0b02 |
|
06-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_add_device to btrfs_add_dev_item Function btrfs_add_device() is adding the device item so rename to reflect that in the function. Similarly we have btrfs_rm_dev_item(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2c997384 |
|
05-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move volume_mutex into the btrfs_rm_device() A cleanup patch no functional change, we hold volume_mutex before calling btrfs_rm_device, so move it into the function itself. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47dba171 |
|
10-Oct-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove rcu_barrier in btrfs_close_devices It was introduced because btrfs used to do blkdev_put in a deferred work, now that btrfs has blkdev_put in place, this rcu_barrier can be removed. modprobe -r btrfs will do btrfs_cleanup_fs_uuids(), where it cleanup every %fs_devices on the list, but when we do btrfs_close_devices(), we have replaced the devices on the list with dummy ones which only have the same name and uuid, so modprobe -r btrfs will free those instead of what we were using, this change won't cause a problem for it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copied 2nd paragraph from mailinglist discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9deae968 |
|
24-Oct-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix memory barriers usage with device stats counters Commit addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared") reworked the way device stats changes are tracked. A new atomic dev_stats_ccnt counter was introduced which is incremented every time any of the device stats counters are changed. This serves as a flag whether there are any pending stats changes. However, this patch only partially implemented the correct memory barriers necessary: - It only ordered the stores to the counters but not the reads e.g. btrfs_run_dev_stats - It completely omitted any comments documenting the intended design and how the memory barriers pair with each-other This patch provides the necessary comments as well as adds a missing smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only a snapshot at best there was no point in reading the counter twice - once in btrfs_dev_stats_dirty and then again when assigning stats_cnt. Just collapse both reads into 1. Fixes: addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1cb34c8e |
|
20-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: clean up btrfs_dev_stat_inc usage btrfs_end_bio() is using btrfs_dev_stat_inc() and then btrfs_dev_stat_print_on_error() separately instead use btrfs_dev_stat_inc_and_print() directly. As of now there isn't any bio in btrfs which is - a non-empty write and also the REQ_PREFLUSH flag is set. So in actual the condition if (bio->bi_opf & REQ_PREFLUSH) is never true in btrfs_end_bio(), and so there won't be any redundant error log by using btrfs_dev_stat_inc_and_print() separately one for write and another for flush. This consolidation will help to add the device critical error handles in the function btrfs_dev_stat_inc_and_print() and which can be renamed as needed. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f5316c1 |
|
23-Oct-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: free btrfs_device in place It's pointless to defer it to a kthread helper as we're not under a special context. For reference, commit 1f78160ce1b1 ("Btrfs: using rcu lock in the reader side of devices list") introduced RCU freeing for device structures. Originally the blkdev_put was called from free_device and rcu_barrier had to be called. This is no longer required, bdev and our device structures are now freed separately. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
beed9263 |
|
13-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix flush bio leak Commit e0ae99941423 ("btrfs: preallocate device flush bio") reworked the way the flush bio is allocated and used. Concretely it allocates the bio in __alloc_device and then re-uses it multiple times with a very simple endio routine that just calls complete() without consuming a reference. Allocated bios by default come with a ref count of 1, which is then consumed by the endio routine (or not, in which case they should be bio_put by the caller). The way the impleementation works now is that the flush bio has a refcount of 2 and we only ever bio_put it once, leaving it to hang indefinitely. Fix this by removing the extra bio_get in __alloc_device. Fixes: e0ae99941423 ("btrfs: preallocate device flush bio") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1751e8a6 |
|
27-Nov-2017 |
Linus Torvalds <torvalds@linux-foundation.org> |
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
619c47f3 |
|
19-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: dev_alloc_list is not protected by RCU, use normal list_del The dev_alloc_list list could be protected by various mutexes, depending on the context. The list tracks devices that can take part of allocating new chunks, so the closest mutex is chunk_mutex. Adding a new device from inside the ADD_DEV ioctl will need device_list_mutex and registering a new device from the ioctl needs uuid_mutex. All mutexes naturally guarantee exclusivity against the same context. The device ownership can move between the contexts and the exclusivity is guaranteed by other means, eg. during the mount with the uuid_mutex. There's no RCU involved for dev_alloc_list. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3065ae5b |
|
30-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: add missing device::flush_bio puts This fixes potential bio leaks, in several error paths. Unfortunatelly the device structure freeing is opencoded in many places and I missed them when introducing the flush_bio. Most of the time, devices get freed through call_rcu(..., free_device), so it at least it's not that easy to hit the leak, but it's still possible through the path that frees stale devices. Fixes: e0ae99941423 ("btrfs: preallocate device flush bio") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5e9f2ad5 |
|
23-Oct-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix transaction abort during failure in btrfs_rm_dev_item btrfs_rm_dev_item calls several function under an active transaction, however it fails to abort it if an error happens. Fix this by adding explicit btrfs_abort_transaction/btrfs_end_transaction calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6dd38f81 |
|
16-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove BUG_ON in btrfs_rm_dev_replace_free_srcdev() That was only an extra check to tackle a few bugs around this area, now its safe to remove it. Replace it by an ASSERT. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
102ed2c5 |
|
13-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix false EIO for missing device When one of the device is missing, bbio_error() takes care of setting the error status. And if its only IO that is pending in that stripe, it fails to check the status of the other IO at %bbio_error before setting the error %bi_status for the %orig_bio. Fix this by checking if %bbio->error has exceeded the %bbio->max_errors. Reproducer as below fdatasync error is seen intermittently. mount -o degraded /dev/sdc /btrfs dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error The reason for the intermittences of the problem is because the following conditions have to be met, which depends on timing: In btrfs_map_bio() - the RAID1 the missing device has to be at %dev_nr = 1 In bbio_error() . before bbio_error() is called the bio of the not-missing device at %dev_nr = 0 must be completed so that the below condition is true if (atomic_dec_and_test(&bbio->stripes_pending)) { Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de483734 |
|
12-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use need_full_stripe() in __btrfs_map_block() A cleanup patch, use need_full_stripe() to replace the open code. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2dbe0c77 |
|
13-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use BLK_STS defines where needed At few places we could use BLK_STS_OK and BLK_STS_NOSUPP. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ dropped first hunk btrfs_endio_direct_read ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b902dfc |
|
08-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix use of error or warning for missing device When device is missing without the -o degraded option then its an error so report it as an error instead of a warning. And when -o degraded option is provided, log the missing device as warning. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ switch error to bool ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a2b8e60 |
|
08-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: declare btrfs_report_missing_device() static Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45dbdbc9 |
|
08-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix EIO misuse to report missing degraded option EIO is only for the IO failure to the device, avoid it. Use ENOENT as that's the closest error code describing what happened. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
adfb69af |
|
10-Oct-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add_missing_dev() should return the actual error add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can be used to check for the actual error that occurred in the function. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> [ minor error message adjustment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f851689b |
|
07-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove nr_async_bios This was intended to congest higher layers to not send bios, but as 1) the congested bit has been taken by writeback Async bios come from buffered writes and DIO writes. For DIO writes, we want to submit them ASAP, while for buffered writes, writeback uses balance_dirty_pages() to throttle how much dirty pages we can have. 2) and no one is waiting for %nr_async_bios down to zero, Historically, it was introduced along with changes which let checksumming workload spread accross different cpus. And at that time, pdflush was used instead of per-bdi flushing, perhaps pdflush did not have the necessary information for writeback to do throttling. We can safely remove them now. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ additional explanation from mails, removed unused variable 'limit' ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7132a262 |
|
28-Sep-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: error out if btrfs_attach_transaction() fails btrfs_init_new_device() calls btrfs_attach_transaction() to commit sys chunks, and it should error out if it fails. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d31c32f6 |
|
28-Sep-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix BUG_ON in btrfs_init_new_device() Instead of BUG_ON return error to the caller. And handle the fail condition by calling the abort transaction and going through the error path. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0af2c4bf |
|
28-Sep-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: undo writable superblocke when sprouting fails When new device is being added to seed FS, seed FS is marked writable, but when we fail to bring in the new device, we missed to undo the writable part. This patch fixes it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1efb72a3 |
|
20-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Rework error handling of add_extent_mapping in __btrfs_alloc_chunk Currently the code executes add_extent_mapping and if it is successful it links the new mapping, it then proceeds to unlock the extent mapping tree and check for failure and handle them. Instead, rework the code to only perform a single check if add_extent_mapping has failed and handle it, otherwise the code continues in a linear fashion. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c9162bdf |
|
23-Aug-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: make some volumes.c functions static These aren't used outside of volumes.c. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
219d33b2 |
|
01-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove batch plug in run_scheduled_IO Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so we don't gain benefits by batching 64 bios here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd7d63c2 |
|
19-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block This seems to be a leftover of commit cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block"). It should use btrfs_op() helper to provide one of 'enum btrfs_map_op' types. Fixes: cf8cddd38bab ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
58efbc9f |
|
23-Aug-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix blk_status_t/errno confusion This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
74d46992 |
|
23-Aug-2017 |
Christoph Hellwig <hch@lst.de> |
block: replace bi_bdev with a gendisk pointer and partitions index This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
b5d9071c |
|
18-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent Currently this function is always called with the object id of the root key of the chunk_tree, which is always BTRFS_CHUNK_TREE_OBJECTID. So let's subsume it straight into the function itself. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ca00afb |
|
18-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent THe function is always called with chunk_objectid set to BTRFS_FIRST_CHUNK_TREE_OBJECTID. Let's collapse the parameter in the function itself. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
408fbf19 |
|
27-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove extraneous chunk_objectid variable BTRFS_FIRST_CHUNK_TREE_OBJECTIS id the only objectid being used in the chunk_tree. So remove a variable which is always set to that value and collapse its usage in callees which are passed this variable. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0174484d |
|
27-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs_make_block_group is always called with chunk_objectid set to BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will change anytime soon, so let's remove the argument and decrease the cognitive load when reading the code path. No functional change Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ce1dd2a |
|
28-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove redundant setting of uuid in btrfs_block_header btrfs_alloc_dev_extent currently unconditionally sets the uuid in the leaf block header the function is working with. This is unnecessary since this operation is peformed by the core btree handling code (splitting a node, allocating a new btree block etc). So let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
db7c942c |
|
16-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused sectorsize variable from struct map_lookup This variable was added in 1abe9b8a138c ("Btrfs: add initial tracepointi support for btrfs"), yet it never really got used, only assigned to. So let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44880fdc |
|
29-Jul-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use appropriate define for the fsid Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we should use the matching constant for the fsid buffer. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f6d2510 |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use named constant for bdev blocksize Superblock is read and written using buffer heads, we need to set the bdev blocksize. The magic constant has been hardcoded in several places, so replace it with a named constant. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
35c70103 |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: refactor find_device helper Polish the helper: * drop underscores, no special meaning here * pass fs_devices, as this is what the API implements * drop noinline, no apparent reason for such simple helper * constify uuid * add comment Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2dfeca9b |
|
13-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: merge alloc_device helpers There are two helpers called in chain from one location, we can merge the functionaliy. Originally, alloc_fs_devices could fill the device uuid randomly if we we didn't give the uuid buffer. This happens for seed devices but the fsid is generated in btrfs_prepare_sprout, so we can remove it. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4ff5fb5 |
|
19-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused parameters from volume.c functions This also adjusts the respective callers in other files. Those were found with -Wunused-parameter. btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3abeb ("Btrfs: RAID5 and RAID6") but it was never really used even in that commit btrfs_is_parity_mirror's mirror_num - same as above chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c4b ("Btrfs: devid subset filter") and never used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
110840bb |
|
19-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove unused variables clear_super - usage was removed in commit cea67ab92d3d ("btrfs: clean the old superblocks before freeing the device") but that change forgot to remove the actual variable. max_key - commit 6174d3cb43aa ("Btrfs: remove unused max_key arg from btrfs_search_forward") removed the max_key parameter but it forgot to remove references from callers. stripe_len - this one was added by e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") but even then it wasn't used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
500ceed8 |
|
14-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove find_raid56_stripe_len find_raid56_stripe_len statically returns SZ_64K which equals BTRFS_STRIPE_LEN. It's sole caller is __btrfs_alloc_chunk and it assigns the return value to ai variable which is already set to BTRFS_STRIPE_LEN. So remove the function invocation altogether and remove the function itself. Also remove the variable since it's only aliasing BTRFS_STRIPE_LEN and use the define directly. Use the occassion to simplify the rounding down of stripe_size now that the value we want it to align is a power of 2. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5502451 |
|
08-Mar-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Enhance message when a device is missing during mount For a missing device, btrfs will just refuse to mount with almost meaningless kernel message like: BTRFS info (device vdb6): disk space caching is enabled BTRFS info (device vdb6): has skinny extents BTRFS error (device vdb6): failed to read the system array: -5 BTRFS error (device vdb6): open_ctree failed This patch will print a new message about the missing device: BTRFS info (device vdb6): disk space caching is enabled BTRFS info (device vdb6): has skinny extents BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b is missing BTRFS error (device vdb6): failed to read the system array: -5 BTRFS error (device vdb6): open_ctree failed Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc3cce23 |
|
08-Mar-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup num_tolerated_disk_barrier_failures As we use per-chunk degradable check, the global num_tolerated_disk_barrier_failures is of no use. We can now remove it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
21634a19 |
|
08-Mar-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount Introduce a new function, btrfs_check_rw_degradable(), to check if all chunks in btrfs is OK for degraded rw mount. It provides the new basis for accurate btrfs mount/remount and even runtime degraded mount check other than old one-size-fit-all method. Btrfs currently uses num_tolerated_disk_barrier_failures to do global check for tolerated missing device. Although the one-size-fit-all solution is quite safe, it's too strict if data and metadata has different duplication level. For example, if one use Single data and RAID1 metadata for 2 disks, it means any missing device will make the fs unable to be degraded mounted. But in fact, some times all single chunks may be in the existing device and in that case, we should allow it to be rw degraded mounted. Such case can be easily reproduced using the following script: # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc # wipefs -f /dev/sdc # mount /dev/sdb -o degraded,rw If using btrfs-debug-tree to check /dev/sdb, one should find that the data chunk is only in sdb, so in fact it should allow degraded mount. This patchset will introduce a new per-chunk degradable check for btrfs, allow above case to succeed, and it's quite small anyway. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copied text from cover letter with more details about the problem being solved ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69f03f13 |
|
11-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Prevent possible ERR_PTR() dereference In btrfs_full_stripe_len/btrfs_is_parity_mirror we have similar code which gets the chunk map for a particular range via get_chunk_map. However, get_chunk_map can return an ERR_PTR value and while the 2 callers do catch this with a WARN_ON they then proceed to indiscriminately dereference the extent map. This of course leads to a crash. Fix the offenders by making the dereference conditional on IS_ERR. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f148ef4d |
|
27-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Be explicit about usage of min() __btrfs_alloc_chunk contains code which boils down to: ndevs = min(ndevs, devs_max) It's conditional upon devs_max not being 0. However, it cannot really be 0 since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use min explicitly. This has no functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5600fd6 |
|
27-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Use explicit round_down call rather than open-coding it No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebcc9301 |
|
27-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: convert while loop to list_for_each_entry No functional changes, just make the loop a bit more readable Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0e4324a4 |
|
21-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: round down size diff when shrinking/growing device Further testing showed that the fix introduced in 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") was insufficient and it could still lead to discrepancies between the total_bytes in the super block and the device total bytes. So this patch also ensures that the difference between old/new sizes when shrinking/growing is also rounded down. This ensure that we won't be subtracting/adding a non-sectorsize multiples to the superblock/device total sizees. Fixes: 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc98a42c |
|
17-Jul-2017 |
David Howells <dhowells@redhat.com> |
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
e0ae9994 |
|
06-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: preallocate device flush bio For devices that support flushing, we allocate a bio, submit, wait for it and then free it. The bio allocation does not fail so ENOMEM is not a problem but we still may unnecessarily stress the allocation subsystem. Instead, we can allocate the bio at the same time we allocate the device and reuse it each time we need to flush the barriers. The bio is reset before each use. Reference counting is simplified to just device allocation (get) and freeing (put). The bio used to be submitted through the integrity checker which will find out that bio has no data attached and call submit_bio. Status of the bio in flight needs to be tracked separately in case the device caches get switched off between write and wait. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7dfb8be1 |
|
16-Jun-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Round down values which are written for total_bytes_size We got an internal report about a file system not wanting to mount following 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 Subtracting the numbers we get a difference of less than a 4kb. Upon closer inspection it became apparent that mkfs actually rounds down the size of the device to a multiple of sector size. However, the same cannot be said for various functions which modify the total size and are called from btrfs_balance as well as when adding a new device. So this patch ensures that values being saved into on-disk data structures are always rounded down to a multiple of sectorsize. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d0c71b3 |
|
14-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: obsolete and remove mount option alloc_start The mount option alloc_start was used in the past for debugging and stressing the chunk allocator. Not meant to be used by users, so we're not breaking anybody's setup. There was some added complexity handling changes of the value and when it was not same as default. Such code has likely been untested and I think it's better to remove it. This patch kills all use of alloc_start, and by doing that also fixes a bug when alloc_size is set, potentially called from statfs: in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU protection is temporarily dropped so btrfs_account_dev_extents_size can be called and then RCU is locked again! Doing that inside list_for_each_entry_rcu is just asking for trouble, but unlikely to be observed in practice. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6165572c |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev The function is called from ioctl context and we don't hold any locks that take part in writeback. Right now it's only fs_info::volume_mutex. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b6c1d56 |
|
02-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: sink gfp parameter to btrfs_bio_clone All callers pass GFP_NOFS. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3aa8e074 |
|
02-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: btrfs_bio_clone never fails, skip error handling Update direct callers of btrfs_bio_clone that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b86826d |
|
17-May-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: cleanup root usage by btrfs_get_alloc_profile There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a5ed45f8 |
|
11-May-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Convert fs_info->free_chunk_space to atomic64_t The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e4cbee9 |
|
03-Jun-2017 |
Christoph Hellwig <hch@lst.de> |
block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
9bcaaea7 |
|
04-May-2017 |
Chris Mason <clm@fb.com> |
btrfs: fix the gfp_mask for the reada_zones radix tree Commits cc8385b59e17 and 7ef70b4d9987a7 added preallocation for the reada radix trees and also switched them over to GFP_KERNEL for the default gfp mask. Since we're doing radix tree insertions under spinlocks, we need to make sure the mask doesn't allow sleeping. This fix keeps the radix preallocation but switches back to the original gfp_mask. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e884f4f0 |
|
04-Apr-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use q which is already obtained from bdev_get_queue We have already assigned q from bdev_get_queue() so use it. And rearrange the code for better view. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42c61ab6 |
|
03-Apr-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: switch to div64_u64 if with a u64 divisor This is fixing code pieces where we use div_u64 when passing a u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
825ad4c9 |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: drop redundant parameters from btrfs_map_sblock All callers pass 0 for mirror_num and 1 for need_raid_map. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
171938e5 |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: track exclusive filesystem operation in flags There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cc8385b5 |
|
02-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: preallocate radix tree node for readahead We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the per-device radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, same as an allocation above anyway, but also because readahead is optional and not on any critical writeout path. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
539b50d2 |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: convert BUG_ON to WARN_ON These two BUG_ON()s would never be true, ensured by callers' logic. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b19a1fe |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: helper for ops that requires full stripe This adds a helper to show directly whether ops require full stripe. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fad823f |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: do not add extra mirror when dev_replace target dev is not available With this, we can avoid allocating memory for dev replace copies if the target dev is not available. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73c0f228 |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: handle operations for device replace separately Since this part is mostly independent, this moves it to a separate function. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ab56090 |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: introduce a function to get extra mirror from replace As the part of getting extra mirror in __btrfs_map_block is independent, this puts it into a separate function. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b3d4cd3 |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: separate DISCARD from __btrfs_map_block Since DISCARD is not as important as an operation like write, we don't copy it to target device during replace, and it makes __btrfs_map_block less complex. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
592d92ee |
|
14-Mar-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: create a helper for getting chunk map We have similar code here and there, this merges them into a helper. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
490b54d6 |
|
03-Mar-2017 |
Elena Reshetova <elena.reshetova@intel.com> |
btrfs: convert extent_map.refs from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
140475ae |
|
03-Mar-2017 |
Elena Reshetova <elena.reshetova@intel.com> |
btrfs: convert btrfs_bio.refs from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14506127 |
|
07-Mar-2017 |
Adam Borowski <kilobyte@angband.pl> |
btrfs: fix a bogus warning when converting only data or metadata If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs balance -dconvert=raid1", the meta.target field will be uninitialized. That's otherwise ok, as it's unused except for this warning. Thus, let's use the existing set of raid levels for the comparison. As a side effect, non-convert balances will now nag about data>metadata. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a967efb3 |
|
10-Apr-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix potential use-after-free for cloned bio KASAN reports that there is a use-after-free case of bio in btrfs_map_bio. If we need to submit IOs to several disks at a time, the original bio would get cloned and mapped to the destination disk, but we really should use the original bio instead of a cloned bio to do the sanity check because cloned bios are likely to be freed by its endio. Reported-by: Diego <diegocg@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fa252992 |
|
15-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: handle allocation error in update_dev_stat_item Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da353f6b |
|
14-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: constify device path passed to relevant helpers Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4a4dce7 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from init_first_rw_device The 'device' used to be added in that function, but now it's done by the caller. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72b468c8 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from __btrfs_alloc_chunk We grab fs_info from other parameters. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efa7c9f9 |
|
02-Feb-2017 |
Jan Kara <jack@suse.cz> |
block: Get rid of blk_get_backing_dev_info() blk_get_backing_dev_info() is now a simple dereference. Remove that function and simplify some code around that. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
34441361 |
|
04-Oct-2016 |
David Sterba <dsterba@suse.com> |
btrfs: opencode chunk locking, remove helpers The helpers are trivial and we don't use them consistently. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3a45bb20 |
|
09-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: remove root parameter from transaction commit/end routines Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ff7e61e |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b246afa |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3796d335 |
|
16-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, lock/unlock_chunks Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da17066c |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb456252 |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5112febb |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_init_new_device should use fs_info->dev_root btrfs_init_new_device only uses the root passed in via the ioctl to start the transaction. Nothing else that happens is related to whatever root the user used to initiate the ioctl. We can drop the root requirement and just use fs_info->dev_root instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6bccf3ab |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: call functions that always use the same root with fs_info instead There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5b4aacef |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: call functions that overwrite their root parameter with fs_info There are 11 functions that accept a root parameter and immediately overwrite it. We can pass those an fs_info pointer instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b159fa28 |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove constant parameter to memset_extent_buffer and rename it The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d24ee97b |
|
09-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: use new helpers to set uuids in eb Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf8cddd3 |
|
27-Oct-2016 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't abuse REQ_OP_* flags for btrfs_map_block btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70fd7614 |
|
01-Nov-2016 |
Christoph Hellwig <hch@lst.de> |
block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
67f055c7 |
|
01-Nov-2016 |
Christoph Hellwig <hch@lst.de> |
btrfs: use op_is_sync to check for synchronous requests Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
19c4d2f9 |
|
10-Oct-2016 |
Chris Mason <clm@fb.com> |
Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" This reverts commit 5d8eb6fe517583f9c6d5b94faf2254a0207a45c9. When we remove devices, we free the device structures. Delaying btfs_remove_chunk() ends up hitting a use-after-free on them. Signed-off-by: Chris Mason <clm@fb.com>
|
#
0ccd0528 |
|
21-Sep-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix a possible umount deadlock btrfs_show_devname() is using the device_list_mutex, sometimes a call to blkdev_put() leads vfs calling into this func. So call blkdev_put() outside of device_list_mutex, as of now. [ 983.284212] ====================================================== [ 983.290401] [ INFO: possible circular locking dependency detected ] [ 983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted [ 983.302081] ------------------------------------------------------- [ 983.308357] umount/21720 is trying to acquire lock: [ 983.313243] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.321264] [ 983.321264] but task is already holding lock: [ 983.327101] (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.337839] [ 983.337839] which lock already depends on the new lock. [ 983.337839] [ 983.346024] [ 983.346024] the existing dependency chain (in reverse order) is: [ 983.353512] -> #4 (&fs_devs->device_list_mutex){+.+...}: [ 983.359096] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.365143] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.371521] [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs] [ 983.378710] [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150 [ 983.384593] [<ffffffff9126ffc7>] m_show+0x17/0x20 [ 983.389957] [<ffffffff91276405>] seq_read+0x2b5/0x3b0 [ 983.395669] [<ffffffff9124c808>] __vfs_read+0x28/0x100 [ 983.401464] [<ffffffff9124eb3b>] vfs_read+0xab/0x150 [ 983.407080] [<ffffffff9124ec32>] SyS_read+0x52/0xb0 [ 983.412609] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.419617] -> #3 (namespace_sem){++++++}: [ 983.424024] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.430074] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.435785] [<ffffffff91272457>] lock_mount+0x67/0x1c0 [ 983.441582] [<ffffffff91272ab2>] do_add_mount+0x32/0xf0 [ 983.447458] [<ffffffff9127363a>] finish_automount+0x5a/0xc0 [ 983.453682] [<ffffffff91259513>] follow_managed+0x1b3/0x2a0 [ 983.459912] [<ffffffff9125b750>] lookup_fast+0x300/0x350 [ 983.465875] [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0 [ 983.471846] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.477731] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.483702] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.489240] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.496254] -> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}: [ 983.501798] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.507855] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.513558] [<ffffffff91366237>] start_creating+0x87/0x100 [ 983.519703] [<ffffffff91366647>] debugfs_create_dir+0x17/0x100 [ 983.526195] [<ffffffff911df153>] bdi_register+0x93/0x210 [ 983.532165] [<ffffffff911df313>] bdi_register_owner+0x43/0x70 [ 983.538570] [<ffffffff914080fb>] device_add_disk+0x1fb/0x450 [ 983.544888] [<ffffffff91580226>] loop_add+0x1e6/0x290 [ 983.550596] [<ffffffff91fec358>] loop_init+0x10b/0x14f [ 983.556394] [<ffffffff91002207>] do_one_initcall+0xa7/0x180 [ 983.562618] [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266 [ 983.569370] [<ffffffff918174be>] kernel_init+0xe/0x100 [ 983.575166] [<ffffffff9182620f>] ret_from_fork+0x1f/0x40 [ 983.581131] -> #1 (loop_index_mutex){+.+.+.}: [ 983.585801] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.591858] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.598256] [<ffffffff9157ed3f>] lo_open+0x1f/0x60 [ 983.603704] [<ffffffff9128eec3>] __blkdev_get+0x123/0x400 [ 983.609757] [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350 [ 983.615639] [<ffffffff9128f554>] blkdev_open+0x64/0x80 [ 983.621428] [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0 [ 983.627651] [<ffffffff9124c029>] vfs_open+0x69/0x80 [ 983.633181] [<ffffffff9125db74>] path_openat+0x834/0xaa0 [ 983.639152] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.645035] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.650999] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.656535] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.663541] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 983.668107] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.674510] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.680561] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.686967] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.692761] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.699699] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.707178] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 983.714380] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 983.721061] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 983.727908] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 983.734744] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 983.740888] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 983.747909] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 983.754745] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 983.760977] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 983.766773] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 983.772738] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 983.778708] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 983.785373] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 983.792212] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 [ 983.799225] [ 983.799225] other info that might help us debug this: [ 983.799225] [ 983.807291] Chain exists of: &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex [ 983.816521] Possible unsafe locking scenario: [ 983.816521] [ 983.822489] CPU0 CPU1 [ 983.827043] ---- ---- [ 983.831599] lock(&fs_devs->device_list_mutex); [ 983.836289] lock(namespace_sem); [ 983.842268] lock(&fs_devs->device_list_mutex); [ 983.849478] lock(&bdev->bd_mutex); [ 983.853127] [ 983.853127] *** DEADLOCK *** [ 983.853127] [ 983.859113] 3 locks held by umount/21720: [ 983.863145] #0: (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70 [ 983.872713] #1: (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs] [ 983.882206] #2: (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.893422] [ 983.893422] stack backtrace: [ 983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1 [ 983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015 [ 983.913492] 0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0 [ 983.921018] ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050 [ 983.928542] ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0 [ 983.936072] Call Trace: [ 983.938545] [<ffffffff91429521>] dump_stack+0x85/0xc4 [ 983.943715] [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c [ 983.949748] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.955613] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.961123] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.966550] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.972407] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.977832] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.983101] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.989500] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.996415] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 984.003068] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 984.009189] [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170 [ 984.014881] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 984.021176] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 984.027476] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 984.033082] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 984.039548] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 984.045839] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 984.051525] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 984.056774] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 984.062201] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 984.067625] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 984.073747] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 984.080038] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ab8d0fc4 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert pr_* to btrfs_* where possible For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62e85577 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d163e0e |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cea67ab9 |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: clean the old superblocks before freeing the device btrfs_rm_device frees the block device but then re-opens it using the saved device name. A race exists between the close and the re-open that allows the block size to be changed. The result is getting stuck forever in the reclaim loop in __getblk_slow. This patch moves the superblock cleanup before closing the block device, which is also consistent with other callers. We also don't need a private copy of dev_name as the whole routine operates under the uuid_mutex. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
afcdd129 |
|
02-Sep-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d8eb6fe |
|
02-Sep-2016 |
Naohiro Aota <naohiro.aota@hgst.com> |
btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But the work can be done by btrfs_delete_unused_bgs() (and it's better since it trim the BG). Let's dedupe the code. While btrfs_delete_unused_bgs() is already hitting the relocated BG, it skip the BG since the BG has "ro" flag set (to keep balancing BG intact). On the other hand, btrfs cannot drop "ro" flag here to prevent additional writes. So this patch make use of "removed" flag. btrfs_delete_unused_bgs() now detect the flag to distinguish whether a read-only BG is relocating or not. Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14238819 |
|
21-Jul-2016 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: do not background blkdev_put() At the end of unmount/dev-delete, if the device exclusive open is not actually closed, then there might be a race with another program in the userland who is trying to open the device in exclusive mode and it may fail for eg: unmount /btrfs; fsck /dev/x btrfs dev del /dev/x /btrfs; fsck /dev/x so here background blkdev_put() is not a choice Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1eff9d32 |
|
05-Aug-2016 |
Jens Axboe <axboe@fb.com> |
block: rename bio bi_rw to bi_opf Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
66642832 |
|
10-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_abort_transaction, drop root parameter __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
05f9a780 |
|
24-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction In btrfs_relocate_chunk, we get a transaction handle via btrfs_start_trans_remove_block_group, which starts the transaction using the extent root. When we call btrfs_end_transaction, we're calling it using the chunk root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14a1e067 |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: introduce BTRFS_MAX_ITEM_SIZE We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the same. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cdde224 |
|
09-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5a488b9d |
|
12-Jul-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix unexpected balance crash due to BUG_ON Mounting a btrfs can resume previous balance operations asynchronously. An user got a crash when one drive has some corrupt sectors. Since balance can cancel itself in case of any error, we can gracefully return errors to upper layers and let balance do the cancel job. Reported-by: sash <master.b.at.raven@chefmail.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e2bf6e89 |
|
23-Jun-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: make sure device is synced before return An inconsistent behavior due to stale reads from the disk was reported mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html This patch will make sure devices are synced before return in the unmount thread. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f448341a |
|
14-Jun-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: reorg btrfs_close_one_device() Moves closer to the caller and removes declaration Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
df5c82a8 |
|
15-Jul-2016 |
Vincent Stehlé <vincent.stehle@intel.com> |
Btrfs: fix comparison in __btrfs_map_block() Add missing comparison to op in expression, which was forgotten when doing the REQ_OP transition. Fixes: b3d3fa519905 ("btrfs: update __btrfs_map_block for REQ_OP transition") Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
b7f67055 |
|
23-Jun-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: Force stripesize to the value of sectorsize Btrfs code currently assumes stripesize to be same as sectorsize. However Btrfs-progs (until commit df05c7ed455f519e6e15e46196392e4757257305) has been setting btrfs_super_block->stripesize to a value of 4096. This commit makes sure that the value of btrfs_super_block->stripesize is a power of 2. Later, it unconditionally sets btrfs_root->stripesize to sectorsize. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c871b0f2 |
|
06-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: check if extent buffer is aligned to sectorsize Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
81a75f67 |
|
05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
btrfs: use bio fields for op and flags The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is no need to pass around the rq_flag_bits bits too. btrfs users should should access the bio insead. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
b3d3fa51 |
|
05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
btrfs: update __btrfs_map_block for REQ_OP transition We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block. It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS, so this drops the bit tests. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
37226b21 |
|
05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
btrfs: use bio op accessors This should be the easier cases to convert btrfs to bio_set_op_attrs/bio_op. They are mostly just cut and replace type of changes. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
4e49ea4a |
|
05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
block/fs/drivers: remove rw argument from submit_bio This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
e06cd3dd |
|
03-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: add validadtion checks for chunk loading To prevent fuzzed filesystem images from panic the whole system, we need various validation checks to refuse to mount such an image if btrfs finds any invalid value during loading chunks, including both sys_array and regular chunks. Note that these checks may not be sufficient to cover all corner cases, feel free to add more checks. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
99e3ecfc |
|
03-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: add more validation checks for superblock This adds validation checks for super_total_bytes, super_bytes_used and super_stripesize, super_num_devices. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d865177a |
|
03-Jun-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: clear uptodate flags of pages in sys_array eb We set uptodate flag to pages in the temporary sys_array eb, but do not clear the flag after free eb. As the special btree inode may still hold a reference on those pages, the uptodate flag can remain alive in them. If btrfs_super_chunk_root has been intentionally changed to the offset of this sys_array eb, reading chunk_root will read content of sys_array and it will skip our beautiful checks in btree_readpage_end_io_hook() because of "pages of eb are uptodate => eb is uptodate" This adds the 'clear uptodate' part to force it to read from disk. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
65d4f4c1 |
|
23-Sep-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: end transaction if we abort when creating uuid root We still need to call btrfs_end_transaction if we call btrfs_abort_transaction, otherwise we hang and make me super grumpy. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
22ab04e8 |
|
18-May-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between device replace and chunk allocation While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
57ba4cb8 |
|
19-May-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between device replace and block group removal When it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] ------------[ cut here ]------------ [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0 [ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18 [ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd [ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc [ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37186a ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
|
#
01327610 |
|
19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c8b5b6e |
|
13-May-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: free sys_array eb as soon as possible While reading sys_chunk_array in superblock, btrfs creates a temporary extent buffer. Since we don't use it after finishing reading sys_chunk_array, we don't need to keep it in memory. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8da4b8c4 |
|
20-May-2016 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
lib/uuid.c: move generate_random_uuid() to uuid.c Let's gather the UUID related functions under one hood. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
48b3b9d4 |
|
12-Apr-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix lock dep warning move scratch super outside of chunk_mutex Move scratch super outside of the chunk lock to avoid below lockdep warning. The better place to scratch super is in the function btrfs_rm_dev_replace_free_srcdev() just before free_device, which is outside of the chunk lock as well. To reproduce: (fresh boot) mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde mount /dev/sdc /btrfs dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 (get devmgt from https://github.com/asj/devmgt.git) devmgt detach /dev/sde dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 sync btrfs replace start -Brf 3 /dev/sdf /btrfs <-- devmgt attach host7 ====================================================== [ INFO: possible circular locking dependency detected ] 4.6.0-rc2asj+ #1 Not tainted --------------------------------------------------- btrfs/2174 is trying to acquire lock: (sb_writers){.+.+.+}, at: [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 but task is already holding lock: (&fs_info->chunk_mutex){+.+.+.}, at: [<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs] which lock already depends on the new lock. Chain exists of: sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(sb_writers); *** DEADLOCK *** -> #0 (sb_writers){.+.+.+}: [<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0 [<ffffffff810e707e>] lock_acquire+0xbe/0x210 [<ffffffff810df49a>] percpu_down_read+0x4a/0xa0 [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 [<ffffffff81265534>] mnt_want_write+0x24/0x50 [<ffffffff812508a2>] path_openat+0x952/0x1190 [<ffffffff81252451>] do_filp_open+0x91/0x100 [<ffffffff8123f5cc>] file_open_name+0xfc/0x140 [<ffffffff8123f643>] filp_open+0x33/0x60 [<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs] [<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs] [<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs] [<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs] [<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs] Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e042d1ec |
|
11-Apr-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: remove BUG_ON()'s in btrfs_map_block btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree to not be assholes and panic at any possible chance things are all fucked up. Signed-off-by: Josef Bacik <jbacik@fb.com> [ removed type casts ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d8da678 |
|
26-Apr-2016 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix divide error upon chunk's stripe_len The struct 'map_lookup' uses type int for @stripe_len, while btrfs_chunk_stripe_len() can return a u64 value, and it may end up with @stripe_len being undefined value and it can lead to 'divide error' in __btrfs_map_block(). This changes 'map_lookup' to use type u64 for stripe_len, also right now we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for BTRFS_STRIPE_LEN. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ folded division fix to scrub_raid56_parity ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
779bf3fe |
|
18-Apr-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex When the replace target fails, the target device will be taken out of fs device list, scratch + update_dev_time and freed. However we could do the scratch + update_dev_time and free part after the device has been taken out of device list, so that we don't have to hold the device_list_mutex and uuid_mutex locks. Reported issue: [ 5375.718845] ====================================================== [ 5375.718846] [ INFO: possible circular locking dependency detected ] [ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted [ 5375.718849] ------------------------------------------------------- [ 5375.718851] btrfs-health/4662 is trying to acquire lock: [ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.718862] [ 5375.718862] but task is already holding lock: [ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs] [ 5375.718907] [ 5375.718907] which lock already depends on the new lock. [ 5375.718907] [ 5375.718908] [ 5375.718908] the existing dependency chain (in reverse order) is: [ 5375.718911] [ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}: [ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0 [ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs] [ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150 [ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20 [ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0 [ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0 [ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130 [ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0 [ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.718968] [ 5375.718968] -> #2 (namespace_sem){+++++.}: [ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80 [ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0 [ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0 [ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30 [ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0 [ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.718991] [ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}: [ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0 [ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360 [ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210 [ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20 [ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.719015] [ 5375.719015] -> #0 (sb_writers){.+.+.+}: [ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0 [ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0 [ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50 [ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360 [ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130 [ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60 [ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs] [ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs] [ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs] [ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs] [ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs] [ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs] [ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs] [ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110 [ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70 [ 5375.719230] [ 5375.719230] other info that might help us debug this: [ 5375.719230] [ 5375.719233] Chain exists of: [ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex [ 5375.719233] [ 5375.719234] Possible unsafe locking scenario: [ 5375.719234] [ 5375.719234] CPU0 CPU1 [ 5375.719235] ---- ---- [ 5375.719236] lock(&fs_devs->device_list_mutex); [ 5375.719238] lock(namespace_sem); [ 5375.719239] lock(&fs_devs->device_list_mutex); [ 5375.719241] lock(sb_writers); [ 5375.719241] [ 5375.719241] *** DEADLOCK *** [ 5375.719241] [ 5375.719243] 4 locks held by btrfs-health/4662: [ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs] [ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs] [ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs] [ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs] [ 5375.719343] [ 5375.719343] stack backtrace: [ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40 [ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015 [ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0 [ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930 [ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004 [ 5375.719357] Call Trace: [ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2 [ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260 [ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0 [ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0 [ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0 [ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0 [ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50 [ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360 [ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs] [ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80 [ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0 [ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120 [ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130 [ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60 [ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs] [ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs] [ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs] [ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs] [ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs] [ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs] [ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs] [ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs] [ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs] [ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs] [ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110 [ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70 [ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 5375.719697] ------------[ cut here ]------------ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
88be159c |
|
23-Mar-2016 |
Austin S. Hemmelgarn <ahferroin7@gmail.com> |
btrfs: allow balancing to dup with multi-device Currently, we don't allow the user to try and rebalance to a dup profile on a multi-device filesystem. In most cases, this is a perfectly sensible restriction as raid1 uses the same amount of space and provides better protection. However, when reshaping a multi-device filesystem down to a single device filesystem, this requires the user to convert metadata and system chunks to single profile before deleting devices, and then convert again to dup, which leaves a period of time where metadata integrity is reduced. This patch removes the single-device-only restriction from converting to dup profile to remove this potential data integrity reduction. Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
88acff64 |
|
03-May-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup assigning next active device with a check Creates helper fucntion as needed by the device delete and replace operations. Also now it checks if the next device being assigned is an active device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ed01abe |
|
14-Apr-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: s_bdev is not null after missing replace Yauhen reported in the ML that s_bdev is null at mount, and s_bdev gets updated to some device when missing device is replaced, as because bdev is null for missing device, things gets matched up. Fix this by checking if s_bdev is set. I didn't want to completely remove updating s_bdev because the future multi device support at vfs layer may need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c5c0df0 |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_find_device_by_user_input For clarity how we are going to find the device, let's call it a device specifier, devspec for short. Also rename the arguments that are a leftover from previous function purpose. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
418775a2 |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: use existing device constraints table btrfs_raid_array We should avoid duplicating the device constraints, let's use the btrfs_raid_array in btrfs_check_raid_min_devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
621292ba |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: introduce raid-type to error-code table, for minimum device constraint Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cc31a0d |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: pass number of devices to btrfs_check_raid_min_devices Before this patch, btrfs_check_raid_min_devices would do an off-by-one check of the constraints and not the miminmum check, as its name suggests. This is not a problem if the only caller is device remove, but would be confusing for others. Add an argument with the exact number and let the caller(s) decide if this needs any adjustments, like when device replace is running. Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f47ab258 |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: rename __check_raid_min_devices Underscores are for special functions, use the full prefix for better stacktrace recognition. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
02feae3c |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: optimize check for stale device Optimize check for stale device to only be checked when there is device added or changed. If there is no update to the device, there is no need to call btrfs_free_stale_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b526ed7 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: introduce device delete by devid This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as an user argument. The patch won't delete the old ioctl interface and so kernel remains backward compatible with user land progs. Test case/script: echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk mount /dev/sdd /btrfs dmsetup suspend bad_disk echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk dmsetup resume bad_disk echo "bad disk failed. now deleting/replacing" btrfs dev del 3 /btrfs echo $? btrfs fi show /btrfs umount /btrfs btrfs-show-super /dev/sdd | egrep num_device dmsetup remove bad_disk wipefs -a /dev/sdf Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Martin <m_btrfs@ml1.co.uk> [ adjust messages, s/disk/device/ ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42b67427 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device() With the previous patches now the btrfs_scratch_superblocks() is ready to be used in btrfs_rm_device() so use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3d1b153 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: enhance btrfs_find_device_by_user_input() to check device path The operation of device replace and device delete follows same steps upto some depth with in btrfs kernel, however they don't share codes. This enhancement will help replace and delete to share codes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
24fc572f |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: make use of btrfs_find_device_by_user_input() btrfs_rm_device() has a section of the code which can be replaced btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
24e0474b |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: create helper btrfs_find_device_by_user_input() The patch renames btrfs_dev_replace_find_srcdev() to btrfs_find_device_by_user_input() and moves it to volumes.c, so that delete device can use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd45ffbc |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: clean up and optimize __check_raid_min_device() __check_raid_min_device() which was pealed from btrfs_rm_device() maintianed its original code to show the block move. This patch cleans up __check_raid_min_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f1fa7f26 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: create helper function __check_raid_min_devices() move a section of btrfs_rm_device() code to check for min number of the devices into the function __check_raid_min_devices() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6cf86a00 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: create a helper function to read the disk super A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner. (In this process it also moves the code which ensures null terminating label). So this creates easy opportunity to merge various duplicate codes on read disk super. Earlier attempt to merge duplicate codes highlighted that there were some issues for which there are duplicate codes (to read disk super), however it was not clear what was the issue. So until we figure that out, its better to keep them in a separate functions. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cf25ce51 |
|
14-Dec-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: do not create empty block group if we have allocated data Now we force to create empty block group to keep data profile alive, however, in the below example, we eventually get an empty block group while we're trying to get more space for other types (metadata/system), - Before, block group "A": size=2G, used=1.2G block group "B": size=2G, used=512M - After "btrfs balance start -dusage=50 mount_point", block group "A": size=2G, used=(1.2+0.5)G block group "C": size=2G, used=0 Since there is no data in block group C, it won't be deleted automatically and we have to get the unused 2G until the next mount. Balance itself just moves data and doesn't remove data, so it's safe to not create such a empty block group if we already have data allocated in other block groups. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34d97007 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_std_error to btrfs_handle_fs_error btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
09cbfeaf |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bb7ab3b9 |
|
04-Mar-2016 |
Adam Buchbinder <adam.buchbinder@gmail.com> |
btrfs: Fix misspellings in comments. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
73beece9 |
|
17-Jul-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix lockdep deadlock warning due to dev_replace Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
242e2956 |
|
25-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: switch dev stats item to the permanent item key Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c479cb4f |
|
25-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: switch balance item to the temporary item key No visible change. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
78f2c9e6 |
|
11-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: device add and remove: use GFP_KERNEL We can safely use GFP_KERNEL in the functions called from the ioctl handlers. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e410e34f |
|
29-Jan-2016 |
Chris Mason <clm@fb.com> |
Revert "btrfs: synchronize incompat feature bits with sysfs files" This reverts commit 14e46e04958df740c6c6a94849f176159a333f13. This ends up doing sysfs operations from deep in balance (where we should be GFP_NOFS) and under heavy balance load, we're making races against sysfs internals. Revert it for now while we figure things out. Signed-off-by: Chris Mason <clm@fb.com>
|
#
14e46e04 |
|
21-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: synchronize incompat feature bits with sysfs files The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount (eg. the LZO compression). Synchronize the feature bits with the sysfs files representing them right after we set/clear them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad1ba2a0 |
|
15-Dec-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Use direct way to determine raid56 write/recover mode Old code used bbio->raid_map to determine whether in raid56 write/recover operation, because we didn't't have bbio->map_type. Now we have direct way for this condition, rid of using the function-relative data, and make the code more readable. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
94a97dfe |
|
09-Dec-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Small cleanup for get index_srcdev loop 1: Adjust condition in loop to make less TAB 2: Move btrfs_put_bbio()'s line for combine, and makes logic clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f04b772b |
|
14-Dec-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Enhance chunk validation check Enhance chunk validation: 1) Num_stripes We already have such check but it's only in super block sys chunk array. Now check all on-disk chunks. 2) Chunk logical It should be aligned to sector size. This behavior should be *DOUBLE CHECKED* for 64K sector size like PPC64 or AArch64. Maybe we can found some hidden bugs. 3) Chunk length Same as chunk logical, should be aligned to sector size. 4) Stripe length It should be power of 2. 5) Chunk type Any bit out of TYPE_MAS | PROFILE_MASK is invalid. With all these much restrict rules, several fuzzed image reported in mail list should no longer cause kernel panic. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fedc0045 |
|
15-Jan-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix typo in log message when starting a balance The recent change titled "Btrfs: Check metadata redundancy on balance" (already in linux-next) left a typo in a message for users: metatdata -> metadata. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fb75d857 |
|
18-Jan-2016 |
Colin Ian King <colin.king@canonical.com> |
btrfs: remove duplicate const specifier duplicate const is redundant so remove it Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
546bed63 |
|
15-Jan-2016 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
btrfs: initialize the seq counter in struct btrfs_device I managed to trigger this: | INFO: trying to register non-static key. | the code is fine but needs lockdep annotation. | turning off the locking correctness validator. | CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14 | Hardware name: ARM-Versatile Express | [<80307cec>] (dump_stack) | [<80070e98>] (__lock_acquire) | [<8007184c>] (lock_acquire) | [<80287800>] (btrfs_ioctl) | [<8012a8d4>] (do_vfs_ioctl) | [<8012ac14>] (SyS_ioctl) so I think that btrfs_device_data_ordered_init() is not invoked behind a macro somewhere. Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95617d69 |
|
03-Jun-2015 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: cleanup, stop casting for extent_map->lookup everywhere Overloading extent_map->bdev to struct map_lookup * might have started out as a means to an end, but it's a pattern that's used all over the place now. Let's get rid of the casting and just add a union instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8cdc7c5b |
|
06-Jan-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix fitrim discarding device area reserved for boot loader's use As of the 4.3 kernel release, the fitrim ioctl can now discard any region of a disk that is not allocated to any chunk/block group, including the first megabyte which is used for our primary superblock and by the boot loader (grub for example). Fix this by not allowing to trim/discard any region in the device starting with an offset not greater than min(alloc_start_mount_option, 1Mb), just as it was not possible before 4.3. A reproducer test case for xfstests follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are # reserved for a boot loader to use (GRUB for example) and btrfs should never # use them - neither for allocating metadata/data nor should trim/discard them. # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem. $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io # Now mount the filesystem and perform a fitrim against it. _scratch_mount _require_batched_discard $SCRATCH_MNT $FSTRIM_PROG $SCRATCH_MNT # Now unmount the filesystem and verify the content of the ranges was not # modified (no trim/discard happened on them). _scratch_unmount echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:" od -t x1 -N $((64 * 1024)) $SCRATCH_DEV od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV status=0 exit Reported-by: Vincent Petry <PVince81@yahoo.fr> Reported-by: Andrei Borzenkov <arvidjaar@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341 Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM) Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
ee592d07 |
|
06-Jan-2016 |
Sam Tygier <samtygier@yahoo.co.uk> |
Btrfs: Check metadata redundancy on balance When converting a filesystem via balance check that metadata mode is at least as redundant as the data mode. For example give warning when: -dconvert=raid1 -mconvert=single Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk> [ minor message reformatting ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4058b54 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup, use enum values for btrfs_path reada Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee22184b |
|
14-Dec-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7928d672 |
|
30-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup, remove stray return statements Signed-off-by: David Sterba <dsterba@suse.com>
|
#
93a3d467 |
|
30-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: verbose error when we find an unexpected item in sys_array Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f5cdedd7 |
|
30-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: handle invalid num_stripes in sys_array We can handle the special case of num_stripes == 0 directly inside btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to catch other unhandled cases where we fail to validate external data. A crafted or corrupted image crashes at mount time: BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0 BTRFS info (device loop0): disk space caching is enabled BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()! Kernel panic - not syncing: BUG! CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25 Stack: 637af890 60062489 602aeb2e 604192ba 60387961 00000011 637af8a0 6038a835 637af9c0 6038776b 634ef32b 00000000 Call Trace: [<6001c86d>] show_stack+0xfe/0x15b [<6038a835>] dump_stack+0x2a/0x2c [<6038776b>] panic+0x13e/0x2b3 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff [<601cfbbe>] open_ctree+0x192d/0x27af [<6019c2c1>] btrfs_mount+0x8f5/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<6019bcb0>] btrfs_mount+0x2e4/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<600d710b>] do_mount+0xa35/0xbc9 [<600d7557>] SyS_mount+0x95/0xc8 [<6001e884>] handle_syscall+0x6b/0x8e Reported-by: Jiri Slaby <jslaby@suse.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> CC: stable@vger.kernel.org # 3.19+ Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5ca8781 |
|
19-Nov-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Support convert to -d dup for btrfs-convert Since we will add support for -d dup for non-mixed filesystem, kernel need to support converting to this raid-type. This patch remove limitation of above case. Tested by following script: (combination of dup conversion with fsck): export TEST_DEV='/dev/vdc' export TEST_DIR='/var/ltf/tester/mnt' do_dup_test() { local m_from="$1" local d_from="$2" local m_to="$3" local d_to="$4" echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to" umount "$TEST_DIR" &>/dev/null ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1 mount "$TEST_DEV" "$TEST_DIR" || return 1 cp -a /sbin/* "$TEST_DIR" [[ "$m_from" != "$m_to" ]] && { ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1 } [[ "$d_from" != "$d_to" ]] && { local opt=() [[ "$d_to" == single ]] && opt+=("-f") ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1 } umount "$TEST_DIR" || return 1 ./btrfsck "$TEST_DEV" || return 1 echo return 0 } test_all() { for m_from in single dup; do for d_from in single dup; do for m_to in single dup; do for d_to in single dup; do do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1 done done done done } test_all Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
140e639f |
|
23-Dec-2015 |
Chris Mason <clm@fb.com> |
btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc map->num_stripes really can't be zero, but just in case. Signed-off-by: Chris Mason <clm@fb.com>
|
#
50460e37 |
|
20-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race when finishing dev replace leading to transaction abort During the final phase of a device replace operation, I ran into a transaction abort that resulted in the following trace: [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]() [23919.664742] BTRFS: Transaction aborted (error -2) [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs] [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1 [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98 [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48 [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0 [23919.696716] Call Trace: [23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs] [23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs] [23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs] [23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs] [23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194 [23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78 [23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff [23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b [23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6 [23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [23919.733330] ---[ end trace 166ef301a335832a ]--- This is due to a race between device replace and chunk allocation, which the following diagram illustrates: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_fallocate() btrfs_alloc_data_chunk_ondemand() btrfs_join_transaction() --> starts a new transaction do_chunk_alloc() lock fs_info->chunk_mutex btrfs_alloc_chunk() --> creates extent map for the new chunk with em->bdev->map->stripes[i]->dev->devid == X (X > 0) --> extent map is added to fs_info->mapping_tree --> initial phase of bg A allocation completes unlock fs_info->chunk_mutex lock fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) btrfs_end_transaction() btrfs_create_pending_block_groups() --> starts final phase of bg A creation (update device, extent, and chunk trees, etc) btrfs_finish_chunk_alloc() btrfs_update_device() --> attempts to update a device item with ID == 0ULL (BTRFS_DEV_REPLACE_DEVID) which is the current ID of bg A's em->bdev->map->stripes[i]->dev->devid --> doesn't find such item returns -ENOENT --> the device id should have been X and not 0ULL got -ENOENT from btrfs_finish_chunk_alloc() and aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid, which is X (and X > 0) frees the srcdev unlock fs_info->chunk_mutex So fix this by taking the device list mutex when processing the chunk's extent map stripes to update the device items. This avoids getting the wrong device id and use-after-free problems if the task finishing a chunk allocation grabs the replaced device, which is freed while the dev replace task is holding the device list mutex. This happened while running fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
|
#
8a7d656f |
|
10-Dec-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix transaction handle leak in balance If we fail to allocate a new data chunk, we were jumping to the error path without release the transaction handle we got before. Fix this by always releasing it before doing the jump. Fixes: 2c9fe8355258 ("btrfs: Fix lost-data-profile caused by balance bg") Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
4db8c528 |
|
03-Dec-2015 |
David Sterba <dsterba@suse.com> |
btrfs: remove a trivial helper btrfs_set_buffer_uptodate Signed-off-by: David Sterba <dsterba@suse.com>
|
#
87ad58c5 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: make btrfs_close_one_device static Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dba72cb3 |
|
16-Nov-2015 |
Holger Hoffstätte <holger.hoffstaette@googlemail.com> |
btrfs: fix balance range usage filters in 4.4-rc There's a regression in 4.4-rc since commit bc3094673f22 (btrfs: extend balance filter usage to take minimum and maximum) in that existing (non-ranged) balance with -dusage=x no longer works; all chunks are skipped. After staring at the code for a while and wondering why a non-ranged balance would even need min and max thresholds (..which then were not set correctly, leading to the bug) I realized that the only problem was the fact that the filter functions were named wrong, thanks to patching copypasta. Simply renaming both functions lets the existing btrfs-progs call balance with -dusage=x and now the non-ranged filter function is invoked, properly using only a single chunk limit. Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
31388ab2 |
|
19-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: fix rcu warning during device replace The test btrfs/011 triggers a rcu warning Reviewed-by: Anand Jain <anand.jain@oracle.com> =============================== [ INFO: suspicious RCU usage. ] 4.4.0-rc1-default+ #286 Tainted: G W ------------------------------- fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by btrfs/28786: 0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs] 1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs] 2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs] 3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs] stack backtrace: CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010 0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001 0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78 ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220 Call Trace: [<ffffffff8141d47b>] dump_stack+0x4f/0x74 [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140 [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs] [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80 [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs] [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs] [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs] [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs] [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs] [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0 [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs] [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0 [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0 [<ffffffff811e3336>] ? __fget_light+0x86/0xb0 [<ffffffff811e3369>] ? __fdget+0x9/0x20 [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80 [<ffffffff811d7483>] SyS_ioctl+0x53/0x80 [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f This is because of unprotected use of rcu_dereference in btrfs_scratch_superblocks. We can't add rcu locks around the whole function because we read the superblock. The fix will use the rcu string buffer directly without the rcu locking. Thi is safe as the device will not go away in the meantime. We're holding the device list mutexes. Restructuring the code to narrow down the rcu section turned out to be impossible, we need to call filp_open (through update_dev_time) on the buffer and this could call kmalloc/__might_sleep. We could call kstrdup with GFP_ATOMIC but it's not absolutely necessary. Fixes: 12b1c2637b6e (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks) Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7fd01182 |
|
13-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix the number of transaction units needed to remove a block group We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8eab77ff |
|
13-Nov-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: use global reserve when deleting unused block group after ENOSPC It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2c9fe835 |
|
08-Nov-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix lost-data-profile caused by balance bg Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage $TEST_DIR dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage $TEST_DIR Result: We can see "no data chunk" in first "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 1.07GiB And "data chunks changed from raid1 to single" in second "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Data,single: Size:256.00MiB, Used:0.00B /dev/vdh 256.00MiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 841.92MiB Reason: btrfs balance delete last data chunk in case of no data in the filesystem, then we can see "no data chunk" by "fi usage" command. And when we do write operation to fs, the only available data profile is 0x0, result is all new chunks are allocated single type. Fix: Allocate a data chunk explicitly to ensure we don't lose the raid profile for data. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d0164adc |
|
06-Nov-2015 |
Mel Gorman <mgorman@techsingularity.net> |
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
bc309467 |
|
20-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: extend balance filter usage to take minimum and maximum Similar to the 'limit' filter, we can enhance the 'usage' filter to accept a range. The change is backward compatible, the range is applied only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag. We don't have a usecase yet, the current syntax has been sufficient. The enhancement should provide parity with other range-like filters. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
dee32d0a |
|
28-Sep-2015 |
GabrÃel Arthúr Pétursson <gabriel@system.is> |
btrfs: add balance filter for stripes Balance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: GabrÃel Arthúr Pétursson <gabriel@system.is> [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
12907fc7 |
|
10-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: extend balance filter limit to take minimum and maximum The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMIT_RANGE). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3204d33c |
|
24-Sep-2015 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add a flags field to btrfs_transaction I want to set some per transaction flags, so instead of adding yet another int lets just convert the current two int indicators to flags and add a flags field for future use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
bdcd3c97 |
|
22-Sep-2015 |
Alexandru Moise <00moses.alexander00@gmail.com> |
btrfs: cleanup btrfs_balance profile validity checks Improve readability by generalizing the profile validity checks. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8789f4fe |
|
15-Sep-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures() btrfs_raid_array[] is used to define all raid attributes, use it to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(), instead of complex condition in function. It can make code simple and auto-support other possible raid-type in future. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af902047 |
|
15-Sep-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Move btrfs_raid_array to public This array is used to record attributes of each raid type, make it public, and many functions will benifit with this array. For example, num_tolerated_disk_barrier_failures(), we can avoid complex conditions in this function, and get raid attribute simply by accessing above array. It can also make code logic simple, reduce duplication code, and increase maintainability. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee863954 |
|
16-Feb-2015 |
David Sterba <dsterba@suse.com> |
btrfs: comment the rest of implicit barriers before waitqueue_active There are atomic operations that imply the barrier for waitqueue_active mixed in an if-condition. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f14d104d |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: switch more printks to our helpers Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b14af3b4 |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: switch message printers to ratelimited _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ecaeb14b |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: switch message printers to _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f190aa47 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: add helper for closing one device Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
097efc96 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: don't log error from btrfs_get_bdev_and_sb Originally the message was not in a helper but ended up there. We should print error messages from callers instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12b1c263 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks This patch updates and renames btrfs_scratch_superblocks, (which is used by the replace device thread), with those fixes from the scratch superblock code section of btrfs_rm_device(). The fixes are: Scratch all copies of superblock Notify kobject that superblock has been changed Update time on the device So that btrfs_rm_device() can use the function btrfs_scratch_superblocks() instead of its own scratch code. And further replace deivce code which similarly releases device back to the system, will have the fixes from the btrfs device delete. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed to btrfs_scratch_superblock] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d74a6259 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not found Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead of -ENOENT. Next this removes the logging when user specifies "missing" and we don't find it in the kernel device list. Logging are for system events not for user input errors. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4553fef |
|
25-Sep-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: consolidate btrfs_error() to btrfs_std_error() btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92fc03fb |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: SB read failure should return EIO for __bread failure This will return EIO when __bread() fails to read SB, instead of EINVAL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1b7e474 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename super_kobj to fsid_kobj Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32576040 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3bd6973 |
|
14-Aug-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
943c6e99 |
|
19-Aug-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance Code for updating fs_info->num_tolerated_disk_barrier_failures in btrfs_balance() lacks raid56 support. Reason: Above code was wroten in 2012-08-01, together with btrfs_calc_num_tolerated_disk_barrier_failures()'s first version. Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated later to support raid56, but code in btrfs_balance() was not updated together. Fix: Merge above similar code to a common function: btrfs_get_num_tolerated_disk_barrier_failures() and make it support both case. It can fix this bug with a bonus of cleanup, and make these code never in above no-sync state from now on. Suggested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3a9508b0 |
|
21-Aug-2015 |
Chris Mason <clm@fb.com> |
btrfs: fix compile when block cgroups are not enabled bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on. This adds an ifdef around them. It's not perfect, but our use of bi_ioc is being removed in the 4.3 merge window. The bi_css usage really should go into bio_clone, but I want to make sure that doesn't introduce problems for other bio_clone use cases. Signed-off-by: Chris Mason <clm@fb.com>
|
#
277fb5fc |
|
19-Aug-2015 |
Michal Hocko <mhocko@suse.com> |
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio alloc_btrfs_bio relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can still fail if the _current_ process is the OOM killer victim. Moreover there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. This would lead to: [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 which is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0e28997e |
|
02-Dec-2013 |
Kent Overstreet <kent.overstreet@gmail.com> |
btrfs: remove bio splitting and merge_bvec_fn() calls Btrfs has been doing bio splitting from btrfs_map_bio(), by checking device limits as well as calling ->merge_bvec_fn() etc. That is not necessary any more, because generic_make_request() is now able to handle arbitrarily sized bios. So clean up unnecessary code paths. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
da2f0f74 |
|
02-Jul-2015 |
Chris Mason <clm@fb.com> |
Btrfs: add support for blkio controllers This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>
|
#
dc2ee4e2 |
|
05-Aug-2015 |
Zhaolei <zhaolei@cn.fujitsu.com> |
btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk() Remove chunk_objectid argument from btrfs_relocate_chunk() because it is not necessary, it can also cleanup some code in caller for prepare its value. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
55e3a601 |
|
05-Aug-2015 |
Zhaolei <zhaolei@cn.fujitsu.com> |
btrfs: Fix data checksum error cause by replace with io-load. xfstests btrfs/070 sometimes failed. In my test machine, its fail rate is about 30%. In another vm(vmware), its fail rate is about 50%. Reason: btrfs/070 do replace and defrag with fsstress simultaneously, after above operation, checksum error is found by scrub. Actually, it have no relationship with defrag operation, only replace with fsstress can trigger this bug. New data writen to target device have possibility rewrited by old data from source device by replace code in debug, to avoid above problem, we can set target block group to readonly in replace period, so new data requested by other operation will not write to same place with replace code. Before patch(4.1-rc3): 30% failed in 100 xfstests. After patch: 0% failed in 300 xfstests. It also happened in btrfs/071 as it's another scrub with IO load tests. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
499f377f |
|
15-Jun-2015 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: iterate over unused chunk space in FITRIM Since we now clean up block groups automatically as they become empty, iterating over block groups is no longer sufficient to discard unused space. This patch iterates over the unused chunk space and discards any regions that are unallocated, regardless of whether they were ever used. This is a change for btrfs but is consistent with other file systems. We do this in a transactionless manner since the discard process can take a substantial amount of time and a transaction would need to be started before the acquisition of the device list lock. That would mean a transaction would be held open across /all/ of the discards collectively. In order to prevent other threads from allocating or freeing chunks, we hold the chunks lock across the search and discard calls. We release it between searches to allow the file system to perform more-or-less normally. Since the running transaction can commit and disappear while we're using the transaction pointer, we take a reference to it and release it after the search. This is safe since it would happen normally at the end of the transaction commit after any locks are released anyway. We also take the commit_root_sem to protect against a transaction starting and committing while we're running. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4246a0b6 |
|
20-Jul-2015 |
Christoph Hellwig <hch@lst.de> |
block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
67c5e7d4 |
|
10-Jun-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between balance and unused block group deletion We have a race between deleting an unused block group and balancing the same block group that leads to an assertion failure/BUG(), producing the following trace: [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622 [181631.220591] ------------[ cut here ]------------ [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062! [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$ [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1 [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000 [181631.224566] RIP: 0010:[<ffffffffa03f19f6>] [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs] [181631.224566] RSP: 0018:ffff8800b5827ae8 EFLAGS: 00010246 [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000 [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200 [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000 [181631.224566] FS: 00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000 [181631.224566] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0 [181631.224566] Stack: [181631.224566] ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e [181631.224566] ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001 [181631.224566] ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000 [181631.224566] Call Trace: [181631.224566] [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs] [181631.224566] [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs] [181631.224566] [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs] [181631.224566] [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs] [181631.224566] [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs] [181631.224566] [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs] [181631.224566] [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf [181631.224566] [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs] [181631.224566] [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15 [181631.224566] [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc [181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2 [181631.224566] [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2 [181631.224566] [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424 [181631.224566] [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479 (...) The sequence of steps leading to this are: CPU 0 CPU 1 btrfs_balance() btrfs_relocate_chunk() btrfs_relocate_block_group(bg X) btrfs_lookup_block_group(bg X) cleaner_kthread locks fs_info->cleaner_mutex btrfs_delete_unused_bgs() finds bg X, which became unused in the previous transaction checks bg X ->ro == 0, so it proceeds sets bg X ->ro to 1 (btrfs_set_block_group_ro(bg X)) blocks on fs_info->cleaner_mutex btrfs_remove_chunk(bg X) unlocks fs_info->cleaner_mutex acquires fs_info->cleaner_mutex relocate_block_group() --> does nothing, no extents found in the extent tree from bg X unlocks fs_info->cleaner_mutex btrfs_relocate_block_group(bg X) returns btrfs_remove_chunk(bg X) extent map not found --> ASSERT(0) Fix this by using a new mutex to make sure these 2 operations, block group relocation and removal, are serialized. This issue is reproducible by running fstests generic/038 (which stresses chunk allocation and automatic removal of unused block groups) together with the following balance loop: while true; do btrfs balance start -dusage=0 <mountpoint> ; done Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
65f53338 |
|
08-Jun-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: cleanup noused initialization of dev in btrfs_end_bio() It is introduced by: c404e0dc2c843b154f9a36c3aec10d0a715d88eb Btrfs: fix use-after-free in the finishing procedure of the device replace But seems no relationship with that bug, this patch revirt these code block for cleanup. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d2ff1b20 |
|
09-Mar-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: sysfs: add support to show replacing target in the sysfs This patch will add support to show the replacing target in sysfs during the process of replacement. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
4fde46f0 |
|
17-Jun-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: free the stale device When btrfs on a device is overwritten with a new btrfs (mkfs), the old btrfs instance in the kernel becomes stale. So with this patch, if kernel finds device is overwritten then delete the stale fsid/uuid. Signed-off-by: Anand Jain <anand.jain@oracle.com>
|
#
4617ea3a |
|
09-Jun-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix necessary chunk tree space calculation when allocating a chunk When allocating a new chunk or removing one we need to update num_devs device items and insert or remove a chunk item in the chunk tree, so in the worst case the space needed in the chunk space_info is: btrfs_calc_trunc_metadata_size(chunk_root, num_devs) + btrfs_calc_trans_metadata_size(chunk_root, 1) That is, in the worst case we need to cow num_devs paths and cow 1 other path that can result in splitting every node and leaf, and each path consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes, which were unnecessary since updating the existing device items does not result in splitting the nodes and leaf since after updating them they remain with the same size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
39c2d7fa |
|
20-May-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix -ENOSPC on block group removal Unlike when attempting to allocate a new block group, where we check that we have enough space in the system space_info to update the device items and insert a new chunk item in the chunk tree, we were not checking if the system space_info had enough space for updating the device items and deleting the chunk item in the chunk tree. This often lead to -ENOSPC error when attempting to allocate blocks for the chunk tree (during btree node/leaf COW operations) while updating the device items or deleting the chunk item, which resulted in the current transaction being aborted and turning the filesystem into read-only mode. While running fstests generic/038, which stresses allocation of block groups and removal of unused block groups, with a large scratch device (750Gb) this happened often, despite more than enough unallocated space, and resulted in the following trace: [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [68663.600407] BTRFS: Transaction aborted (error -28) (...) [68663.730829] Call Trace: [68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs] [68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs] [68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs] [68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs] [68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c [68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs] [68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs] [68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7 [68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90 [68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.849503] ---[ end trace 798477c6d6dbaad6 ]--- [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left So fix this by verifying that enough space exists in system space_info, and reserving the space in the chunk block reserve, before attempting to delete the block group and allocate a new system chunk if we don't have enough space to perform the necessary updates and delete in the chunk tree. Like for the block group creation case, we don't error our if we fail to allocate a new system chunk, since we might end up not needing it (no node/leaf splits happen during the COW operations and/or we end up not needing to COW any btree nodes or leafs because they were already COWed in the current transaction and their writeback didn't start yet). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c152b63e |
|
14-May-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix chunk allocation regression leading to transaction abort With commit 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") introduced in the kernel 4.1 merge window, we end up using part of a device hole for which there are already pending chunks or pinned chunks. Before that commit we didn't use the hole and would just move on to the next hole in the device. However when we adjust the start offset for the chunk allocation and we have pinned chunks, we set it blindly to the end offset of the pinned chunk we are currently processing, which is dangerous because we can have a pending chunk that has a start offset that matches the end offset of our pinned chunk - leading us to a case where we end up getting two pending chunks that start at the same physical device offset, which makes us later abort the current transaction with -EEXIST when finishing the chunk allocation at btrfs_create_pending_block_groups(): [194737.659017] ------------[ cut here ]------------ [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [194737.662209] BTRFS: Transaction aborted (error -17) [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [194737.682999] 0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2 [194737.684540] ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78 [194737.686017] ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40 [194737.687509] Call Trace: [194737.688068] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [194737.689027] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [194737.690095] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [194737.691198] [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.693789] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [194737.695065] [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.696806] [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [194737.698683] [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs] [194737.700329] [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs] [194737.701924] [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [194737.703675] [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs] [194737.705417] [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [194737.707058] [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [194737.708560] [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs] [194737.710673] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [194737.712076] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [194737.713293] [<ffffffff81153b58>] vfs_write+0xb2/0x117 [194737.714443] [<ffffffff81154424>] SyS_pwrite64+0x64/0x82 [194737.715646] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]--- [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by btrfs_create_pending_block_groups(), when it attempts to insert a duplicated device extent item via btrfs_alloc_dev_extent(). This issue was reproducible with fstests generic/038 running in a loop for several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard". Applying Jeff's recent patch titled "btrfs: add missing discards when unpinning extents with -o discard" makes the issue much easier to reproduce (usually within 4 to 5 hours), since it pins chunks for longer periods of time when an unused block group is deleted by the cleaner kthread. Fix this by making sure that we never adjust the start offset to a lower value than it currently has. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2037a093 |
|
12-May-2015 |
Sasha Levin <sasha.levin@oracle.com> |
btrfs: use after free when closing devices __btrfs_close_devices() would call_rcu to free the device, which is racy with list_for_each_entry() accessing the memory to retrieve the next device on the list. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
33b97e43 |
|
07-May-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: check error before reporting missing device and add uuid Report missing device when add is successful, otherwise it would exit as ENOMEM. And add uuid to the report. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
816fcebe |
|
26-Apr-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: log when missing device is created Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6d13f549 |
|
24-Apr-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: fix warnings after changes in btrfs_abort_transaction fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
53e489bc |
|
02-Jun-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: check pending chunks when shrinking fs to avoid corruption When we shrink the usable size of a device (its total_bytes), we go over all the device extent items in the device tree and attempt to relocate the chunk of any device extent that goes beyond the new usable size for the device. We do that after setting the new usable size (total_bytes) in the device object, so that all new allocations (and reallocations) don't use areas of the device that go beyond the new (shorter) size. However we were not considering that before setting the new size in the device, pending chunks might have been created that use device extents that go beyond the new size, and those device extents are not yet in the device tree after we search the device tree - they are still attached to the list of new block group for some ongoing transaction handle, and they are only added to the device tree when the transaction handle is ended (via btrfs_create_pending_block_groups()). So check for pending chunks with device extents that go beyond the new size and if any exists, commit the current transaction and repeat the search in the device tree. Not doing this it would mean we would return success to user space while still having extents that go beyond the new size, and later user space could override those locations on the device while the fs still references them, causing all sorts of corruption and unexpected events. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2421a8cd |
|
09-Mar-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
6c14a164 |
|
09-Mar-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info since btrfs_kobj_rm_device() does nothing with fs_info Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
1ba43816 |
|
09-Mar-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info btrfs_kobj_add_device() does not need fs_info any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
5a13f430 |
|
09-Mar-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: sysfs: add pointer to access fs_info from fs_devices adds fs_info pointer with struct btrfs_fs_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
c73eccf7 |
|
09-Mar-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
2e7910d6 |
|
09-Mar-2015 |
Anand Jain <Anand.Jain@oracle.com> |
Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices This patch will provide a framework and help to create attributes from the structure btrfs_fs_devices which are available even before fs_info is created. So by moving the parent kobject super_kobj from fs_info to btrfs_fs_devices, it will help to create attributes from the btrfs_fs_devices as well. Patches on top of this patch now will be able to create the sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices when devices are scanned and registered to the kernel. Just to note, this does not change any of the existing btrfs sysfs external kobject names and its attributes and not even the life cycle of them. Changes are internal only. And to ensure the same, this path has been tested with various device operations and, checking and comparing the sysfs kobjects and attributes with sysfs kobject and attributes with out this patch, and they remain same. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
326e1dbb |
|
22-May-2015 |
Mike Snitzer <snitzer@redhat.com> |
block: remove management of bi_remaining when restoring original bi_end_io Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
a9629596 |
|
18-May-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix racy system chunk allocation when setting block group ro If while setting a block group read-only we end up allocating a system chunk, through check_system_chunk(), we were not doing it while holding the chunk mutex which is a problem if a concurrent chunk allocation is happening, through do_chunk_alloc(), as it means both block groups can end up using the same logical addresses and physical regions in the device(s). So make sure we hold the chunk mutex. Cc: stable@vger.kernel.org # 4.0+ Fixes: 2f0810880f08 ("btrfs: delete chunk allocation attemp when setting block group ro") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
dac56212 |
|
17-Apr-2015 |
Jens Axboe <axboe@fb.com> |
bio: skip atomic inc/dec of ->bi_cnt for most use cases Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
|
#
1b984508 |
|
09-Feb-2015 |
Forrest Liu <forrestl@synology.com> |
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole If device tree has hole, find_free_dev_extent() cannot find available address properly. The problem can be reproduce by following script. mntpath=/btrfs loopdev=/dev/loop0 filepath=/home/forrest/image umount $mntpath losetup -d $loopdev truncate --size 100g $filepath losetup $loopdev $filepath mkfs.btrfs -f $loopdev mount $loopdev $mntpath # make device tree with one big hole for i in `seq 1 1 100`; do fallocate -l 1g $mntpath/$i done sync for i in `seq 1 1 95`; do rm $mntpath/$i done sync # wait cleaner thread remove unused block group sleep 300 fallocate -l 1g $mntpath/aaa # failed to allocate new chunk fallocate -l 1g $mntpath/bbb Above script will make device tree with one big hole, and can only allocate just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 106292051968 length 1073741824 item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 103108575232 length 1073741824 Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f2ab7618 |
|
16-Feb-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix tail space processing in find_free_dev_extent() It is another reason for NO_SPACE case. When we found enough free space in loop and saved them to max_hole_start/size before, and tail space contains pending extent, origional innocent max_hole_start/size are reset in retry. As a result, find_free_dev_extent() returns less space than it can, and cause NO_SPACE in user program. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
258ece02 |
|
24-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: remove shadowing variables in __btrfs_map_block 1) We can safely use the function's 'i'. Fixes warning fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here 2) A local variable duplicates name of an argument, we can use the value directly. Fixes warning fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
9d644a62 |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup, use correct type in div_u64_rem div_u64_rem expects u32 for divisior and reminder. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
47c5713f |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: replace remaining do_div calls with div_u64 variants Switch to div_u64_rem that does type checking and has more obvious semantics than do_div. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b8b93add |
|
16-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup 64bit/32bit divs, provably bounded values The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
31e818fe |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup, use kmalloc_array/kcalloc array helpers Convert kmalloc(nr * size, ..) to kmalloc_array that does additional overflow checks, the zeroing variant is kcalloc. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
853d8ec4 |
|
08-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: need_resched not needed with cond_resched Cleanup, no special reason to do if (need_resched()) cond_resched(); Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
29cf342b |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup, use correct type in div_u64_rem div_u64_rem expects u32 for divisior and reminder. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
35b850f1 |
|
20-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: replace remaining do_div calls with div_u64 variants Switch to div_u64_rem that does type checking and has more obvious semantics than do_div. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
c7abe829 |
|
16-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup 64bit/32bit divs, provably bounded values The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e57cf21e |
|
19-Feb-2015 |
Chris Mason <clm@fb.com> |
Btrfs: fix allocation size calculations in alloc_btrfs_bio Since commit 8e5cfb55d3f (Btrfs: Make raid_map array be inlined in btrfs_bio structure), the raid map array is allocated along with the btrfs bio in alloc_btrfs_bio. The calculation used to decide how much we need to allocate was using the wrong parameter passed into the allocation function. The passed in real_stripes will be zero if a target replace operation is not currently running. We want to use total_stripes instead. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: David Sterba <dsterba@suse.cz> Tested-by: David Sterba <dsterba@suse.cz>
|
#
9eaed21e |
|
01-Aug-2014 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: remove unused fs_info arg from btrfs_close_extra_devices() The commit: 8dabb74 Btrfs: change core code of btrfs to support the device replace operations added the fs_info argument, but never used it - just remove it again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
08da757d |
|
12-Feb-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: cleanup: use for() loop in btrfs_map_bio() for() is obviously better in these code block, and remove noused init-value to reduce about 6 bytes binary size. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
a688a04a |
|
09-Feb-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: remove unused chunk_tree argument in several functions There functions include unused chunk_tree argument from the begining, it is time to remove them and clean up relative code to prepare value of this argument in caller. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e8c9f186 |
|
02-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: constify structs with op functions or static definitions There are some op tables that can be easily made const, similarly the sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
13212b54 |
|
11-Feb-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
btrfs: Fix out-of-space bug Btrfs will report NO_SPACE when we create and remove files for several times, and we can't write to filesystem until mount it again. Steps to reproduce: 1: Create a single-dev btrfs fs with default option 2: Write a file into it to take up most fs space 3: Delete above file 4: Wait about 100s to let chunk removed 5: goto 2 Script is like following: #!/bin/bash # Recommend 1.2G space, too large disk will make test slow DEV="/dev/sda16" MNT="/mnt/tmp" dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024)) echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev" for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done echo "mkfs $DEV" mkfs.btrfs -f "$DEV" >/dev/null || exit 2 echo "mount $DEV $MNT" mount "$DEV" "$MNT" || exit 2 for ((loop_i = 0; loop_i < 20; loop_i++)); do echo echo "loop $loop_i" echo "dd file..." cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m") "${cmd[@]}" 2>/dev/null || { # NO_SPACE error triggered echo "dd failed: ${cmd[*]}" exit 1 } echo "rm file..." rm -f "$MNT"/file0 || exit 2 for ((i = 0; i < 10; i++)); do df "$MNT" | tail -1 sleep 10 done done Reason: It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a which is used to remove empty block groups automatically, but the reason is not in that patch. Code before works well because btrfs don't need to create and delete chunks so many times with high complexity. Above bug is caused by many reason, any of them can trigger it. Reason1: When we remove some continuous chunks but leave other chunks after, these disk space should be used by chunk-recreating, but in current code, only first create will successed. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason2: contains_pending_extent() return wrong value in calculation. Fixed by Forrest Liu <forrestl@synology.com> in: Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Reason3: btrfs_check_data_free_space() try to commit transaction and retry allocating chunk when the first allocating failed, but space_info->full is set in first allocating, and prevent second allocating in retry. Fixed in this patch by clear space_info->full in commit transaction. Tested for severial times by above script. Changelog v3->v4: use light weight int instead of atomic_t to record have_remove_bgs in transaction, suggested by: Josef Bacik <jbacik@fb.com> Changelog v2->v3: v2 fixed the bug by adding more commit-transaction, but we only need to reclaim space when we are really have no space for new chunk, noticed by: Filipe David Manana <fdmanana@gmail.com> Actually, our code already have this type of commit-and-retry, we only need to make it working with removed-bgs. v3 fixed the bug with above way. Changelog v1->v2: v1 will introduce a new bug when delete and create chunk in same disk space in same transaction, noticed by: Filipe David Manana <fdmanana@gmail.com> V2 fix this bug by commit transaction after remove block grops. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Suggested-by: Filipe David Manana <fdmanana@gmail.com> Suggested-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e3540eab |
|
05-Nov-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: add more checks to btrfs_read_sys_array Verify that the sys_array has enough bytes to read the next item. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1ffb22cf |
|
31-Oct-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup, rename a few variables in btrfs_read_sys_array There's a pointer to buffer, integer offset and offset passed as pointer, try to find matching names for them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ffe2d203 |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
10f11900 |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Include map_type in raid_bio Corrent code use many kinds of "clever" way to determine operation target's raid type, as: raid_map != NULL or raid_map[MAX_NR] == RAID[56]_Q_STRIPE To make code easy to maintenance, this patch put raid type into bbio, and we can always get raid type from bbio with a "stupid" way. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6e9606d2 |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: add ref_count and free function for btrfs_bio 1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8e5cfb55 |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Make raid_map array be inlined in btrfs_bio structure It can make code more simple and clear, we need not care about free bbio and raid_map together. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
cc7539ed |
|
20-Jan-2015 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: sort raid_map before adding tgtdev stripes It can avoid complex calculation of real stripes in sort, moreover, we can clean up code of sorting tgtdev_map because it will be in order initially. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
98af592f |
|
14-Dec-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: filp_open() returns ERR_PTR() on failure, not NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a83fffb7 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: sink blocksize parameter to btrfs_find_create_tree_block Finally it's clear that the requested blocksize is always equal to nodesize, with one exception, the superblock. Superblock has fixed size regardless of the metadata block size, but uses the same helpers to initialize sys array/chunk tree and to work with the chunk items. So it pretends to be an extent_buffer for a moment, btrfs_read_sys_array is full of special cases, we're adding one more. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
495e64f4 |
|
02-Dec-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix fs mapping extent map leak On chunk allocation error (label "error_del_extent"), after adding the extent map to the tree and to the pending chunks list, we would leave decrementing the extent map's refcount by 2 instead of 3 (our allocation + tree reference + list reference). Also, on chunk/block group removal, if the block group was on the list pending_chunks we weren't decrementing the respective list reference. Detected by 'rmmod btrfs': [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-1+ #1 [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [20770.106130] 0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040 [20770.106132] ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0 [20770.106134] ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78 [20770.106136] Call Trace: [20770.106142] [<ffffffff813e7a13>] dump_stack+0x45/0x56 [20770.106145] [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90 [20770.106164] [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs] [20770.106176] [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs] [20770.106179] [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4 [20770.106182] [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c [20770.106184] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b This applies on top (depends on) of my previous patch titled: "Btrfs: fix race between fs trimming and block group remove/allocation" But the issue in fact was already present before that change, it only became easier to hit after Josef's 3.18 patch that added automatic removal of empty block groups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
04216820 |
|
27-Nov-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between fs trimming and block group remove/allocation Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4245215d |
|
25-Nov-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 The commit c404e0dc (Btrfs: fix use-after-free in the finishing procedure of the device replace) fixed a use-after-free problem which happened when removing the source device at the end of device replace, but at that time, btrfs didn't support device replace on raid56, so we didn't fix the problem on the raid56 profile. Currently, we implemented device replace for raid56, so we need kick that problem out before we enable that function for raid56. The fix method is very simple, we just increase the bio per-cpu counter before we submit a raid56 io, and decrease the counter when the raid56 io ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
2c8cdd6e |
|
14-Nov-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs, replace: write dirty pages into the replace target device The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
af8e2d1d |
|
23-Oct-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
6de65650 |
|
12-Nov-2014 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block stripe_index's value was set again in latter line: stripe_index = 0; Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz>
|
#
f90523d1 |
|
12-Nov-2014 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: remove noused bbio_ret in __btrfs_map_block in condition bbio_ret in this condition is always !NULL because previous code already have a check-and-skip: 4908 if (!bbio_ret) 4909 goto out; Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz>
|
#
084b6e7c |
|
30-Oct-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Fix a lockdep warning when running xfstest. The following lockdep warning is triggered during xfstests: [ 1702.980872] ========================================================= [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ] [ 1702.981482] 3.18.0-rc1 #27 Not tainted [ 1702.981781] --------------------------------------------------------- [ 1702.982095] kswapd0/77 just changed the state of lock: [ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs] [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1702.984675] other info that might help us debug this: [ 1702.985524] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1702.986799] Possible interrupt unsafe locking scenario: [ 1702.987681] CPU0 CPU1 [ 1702.988137] ---- ---- [ 1702.988598] lock(&fs_info->dev_replace.lock); [ 1702.989069] local_irq_disable(); [ 1702.989534] lock(&delayed_node->mutex); [ 1702.990038] lock(&found->groups_sem); [ 1702.990494] <Interrupt> [ 1702.990938] lock(&delayed_node->mutex); [ 1702.991407] *** DEADLOCK *** It is because the btrfs_kobj_{add/rm}_device() will call memory allocation with GFP_KERNEL, which may flush fs page cache to free space, waiting for it self to do the commit, causing the deadlock. To solve the problem, move btrfs_kobj_{add/rm}_device() out of the dev_replace lock range, also involing split the btrfs_rm_dev_replace_srcdev() function into remove and free parts. Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are called out of the lock range. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1b6e4469 |
|
24-Sep-2014 |
Fabian Frederick <fabf@skynet.be> |
Btrfs: fix compilation errors under DEBUG bi_sector and bi_size moved to bi_iter since commit 4f024f3797c4 ("block: Abstract out bvec iterator") Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>
|
#
47ab2a6c |
|
18-Sep-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: remove empty block groups automatically One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0f23ae74 |
|
18-Sep-2014 |
Chris Mason <clm@fb.com> |
Revert "Btrfs: device_list_add() should not update list when mounted" This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f. This commit is triggering failures to mount by subvolume id in some configurations. The main problem is how many different ways this scanning function is used, both for scanning while mounted and unmounted. A proper cleanup is too big for late rcs. For now, just revert the commit and we'll put a better fix into a later merge window. Signed-off-by: Chris Mason <clm@fb.com>
|
#
28e1cc7d |
|
12-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Set real mirror number for read operation on RAID0/5/6 We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c3929c36 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: modify rw_devices counter under chunk_mutex context rw_devices counter is often used to tune the profile when doing chunk allocation, so we should modify it under the chunk_mutex context to avoid getting wrong chunk profile. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5f375835 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: move the missing device to its own fs device list For a missing device, we don't know it belong to which fs before we read its fsid from the chunk tree. So we add them into the current fs device list at first. When we get its fsid, we should move them to their own fs device list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
416d7b80 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs When we open a seed filesystem, if the degraded mount option is set, we continue to mount the fs if we don't find some devices in the seed filesystem. But we should stop mounting if other errors happen. Fix it Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
82372bc8 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make the logic of source device removing more clear Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
67a2c45e |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix use-after-free problem of the device during device replace The problem is: Task0(device scan task) Task1(device replace task) scan_one_device() mutex_lock(&uuid_mutex) device = find_device() mutex_lock(&device_list_mutex) lock_chunk() rm_and_free_source_device unlock_chunk() mutex_unlock(&device_list_mutex) check device Destroying the target device if device replace fails also has the same problem. We fix this problem by locking uuid_mutex during destroying source device or target device, just like the device remove operation. It is a temporary solution, we can fix this problem and make the code more clear by atomic counter in the future. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
adbbb863 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected device list access when cloning fs devices We can build a new filesystem based a seed filesystem, and we need clone the fs devices when we open the new filesystem. But someone might clear the seed flag of the seed filesystem, then mount that filesystem and remove some device. If we mount the new filesystem, we might access a device list which was being changed when we clone the fs devices. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2196d6e8 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Fix misuse of chunk mutex There were several problems about chunk mutex usage: - Lock chunk mutex when updating metadata. It would cause the nested deadlock because updating metadata might need allocate new chunks that need acquire chunk mutex. We remove chunk mutex at this case, because b-tree lock and other lock mechanism can help us. - ABBA deadlock occured between device_list_mutex and chunk_mutex. When we update device status, we must acquire device_list_mutex at the beginning, and then we might get chunk_mutex during the device status update because we need allocate new chunks for metadata COW. But at most place, we acquire chunk_mutex at first and then acquire device list mutex. We need change the lock order. - Some place we needn't acquire chunk_mutex. For example we needn't get chunk_mutex when we free a empty seed fs_devices structure. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fe48a5c0 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected system chunk array insertion We didn't protect the system chunk array when we added a new system chunk into it, it would cause the array be corrupted if someone remove/add some system chunk into array at the same time. Fix it by chunk lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7cc8e58d |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected device's variants on 32bits machine ->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk lock when we change them, but sometimes we read them without any lock, and we might get unexpected value. We fix this problem like inode's i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1c116187 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: update free_chunk_space during allocting a new chunk We should update free_chunk_space in time when we allocate a new chunk, not when we deal with the pending device update and block group insertion, because we need the real free_chunk_space data to calculate the reserved space, if we don't update it in time, we would consider the disk space which has be allocated as free space, and would use it to do overcommit reservation. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
43530c46 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected device->bytes_used update We should update device->bytes_used in the lock context of chunk_mutex, or we would get wrong data. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5d778aae |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Fix wrong free_chunk_space assignment during removing a device During removing a device, we have modified free_chunk_space when we shrink the device, so we needn't assign a new value to it after the device shrink. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ce7213c7 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong device bytes_used in the super block device->bytes_used will be changed when allocating a new chunk, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Though it is not big problem because we don't use it now, but anyway it is better that we make it be consistent with the common metadata, maybe we will use it in the future. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
935e5cc9 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong disk size when writing super blocks total_size will be changed when resizing a device, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1c43366d |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected assignment of the target device We didn't protect the assignment of the target device, it might cause the problem that the super block update was skipped because we might find wrong size of the target device during the assignment. Fix it by moving the assignment sentences into the initialization function of the target device. And there is another merit that we can check if the target device is suitable more early. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
90180da4 |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: cleanup unused num_can_discard in fs_devices The member variants - num_can_discard - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3c1dbdf5 |
|
19-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: rename total_bytes to avoid confusion we are assigning number_devices to the total_bytes, that's very confusing for a moment Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b2efedca |
|
13-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev() seed fs devices don't participate as rw_device, so don't increment rw_devices when the device being handled belongs to a seed fs. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8bef8401 |
|
13-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: fix memory leak when there is no more seed device When we replace all the seed device in the system there is no point in just keeping the btrfs_fs_devices with out any device Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
94d5f0c2 |
|
13-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: update sprout seed pointer when seed fs is relinquished We are not updating sprout fs seed pointer when all seed device is replaced. This patch will check if all seed device has been replaced and then update the sprout pointer accordingly. Same reproducer as in the previous patch would apply here. And notice that btrfs_close_device will check if seed fs is present and spits out the error with out this patch. int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { :: seed_devices = fs_devices->seed; :: while (seed_devices) { fs_devices = seed_devices; seed_devices = fs_devices->seed; __btrfs_close_devices(fs_devices); free_fs_devices(fs_devices); } Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
25e8e911 |
|
19-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: replace seed device followed by unmount causes kernel WARNING reproducer: mount /dev/sdb /btrfs btrfs dev add /dev/sdc /btrfs btrfs rep start -B /dev/sdb /dev/sdd /btrfs umount /btrfs WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]() :: __btrfs_close_devices() :: WARN_ON(fs_devices->open_devices); After the seed device has been replaced the new target device is no more a seed device. So we need to update the device numbers in the fs_devices as pointed by the fs_info. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d51908ce |
|
13-Aug-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: preparatory to make btrfs_rm_dev_replace_srcdev() seed aware There is no logical change in this patch, just a preparatory patch, so that changes can be easily reasoned. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f98de9b9 |
|
04-Aug-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make btrfs_search_forward return with nodes unlocked None of the uses of btrfs_search_forward() need to have the path nodes (level >= 1) read locked, only the leaf needs to be locked while the caller processes it. Therefore make it return a path with all nodes unlocked, except for the leaf. This change is motivated by the observation that during a file fsync we repeatdly call btrfs_search_forward() and process the returned leaf while upper nodes of the returned path (level >= 1) are read locked, which unnecessarily blocks other tasks that want to write to the same fs/subvol btree. Therefore instead of modifying the fsync code to unlock all nodes with level >= 1 immediately after calling btrfs_search_forward(), change btrfs_search_forward() to do it, so that it benefits all callers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
443f24fe |
|
23-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: cleanup unused latest_devid and latest_trans in fs_devices The member variants - latest_devid and latest_trans - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
addc3fa7 |
|
23-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Fix the problem that the dirty flag of dev stats is cleared The io error might happen during writing out the device stats, and the device stats information and dirty flag would be update at that time, but the current code didn't consider this case, just clear the dirty flag, it would cause that we forgot to write out the new device stats information. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
14586651 |
|
08-Jul-2014 |
HIMANGI SARAOGI <himangi774@gmail.com> |
Btrfs: use BUG_ON Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d20983b4 |
|
03-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix writing data into the seed filesystem If we mounted a seed filesystem with degraded option, and then added a new device into the seed filesystem, then we found adding device failed because of the IO failure. Steps to reproduce: # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1> # btrfstune -S 1 <dev0> # mount <dev0> -o degraded <mnt> # btrfs device add -f <dev2> <mnt> It is because the original didn't set the chunk on the seed device to be read-only if the degraded flag was set. It was introduced by patch f48b90756, which fixed the problem the raid1 filesystem became read-only after one device of it was missing. But this fix method was not right, we should set the read-only flag according to the number of the missing devices, not the degraded mount option, if the number of the missing devices is less than the max error number that the profile of the chunk tolerates, we don't set it to be read-only. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
962a298f |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: kill the key type accessor helpers btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9e0af237 |
|
15-Aug-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix task hang under heavy compressed write This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7df69d3e |
|
23-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Fix wrong device size when we are resizing the device total_bytes of device is just a in-memory variant which is used to record the size of the device, and it might be changed before we resize a device, if the resize operation fails, it will be fallbacked. But some code used it to update on-disk metadata of the device, it would cause the problem that on-disk metadata of the devices was not consistent. We should use the other variant named disk_total_bytes to update the on-disk metadata of device, because that variant is updated only when the resize operation is successful. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ff61d17c |
|
23-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: Fix the problem that the replace destroys the seed filesystem The seed filesystem was destroyed by the device replace, the reproduce method is: # mkfs.btrfs -f <dev0> # btrfstune -S 1 <dev0> # mount <dev0> <mnt> # btrfs device add <dev1> <mnt> # umount <mnt> # mount <dev1> <mnt> # btrfs replace start -f <dev0> <dev2> <mnt> # umount <mnt> # mount <dev0> <mnt> It is because we erase the super block on the seed device. It is wrong, we should not change anything on the seed device. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3a7d55c8 |
|
16-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong missing device counter decrease The missing devices are accounted by its own fs device, for example the missing devices in seed filesystem will be accounted by the fs device of the seed filesystem, not by the new filesystem which is based on the seed filesystem, so when we remove the missing device in the seed filesystem, we should decrease the counter of its own fs device. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
69611ac8 |
|
03-Jul-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs We forgot to zero some members in fs_devices when we create new fs_devices from the one of the seed fs. It would cause the problem that we got wrong chunk profile when allocating chunks. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
77bdae4d |
|
03-Jul-2014 |
Anand Jain <anand.jain@oracle.com> |
btrfs: check generation as replace duplicates devid+uuid When FS in unmounted we need to check generation number as well since devid+uuid combination could match with the missing replaced disk when it reappears, and without this patch it might pair with the replaced disk again. device_list_add() function is called in the following threads, mount device option mount argument ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan) ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>) they have been unit tested to work fine with this patch. If the user knows what he is doing and really want to pair with replaced disk (which is not a standard operation), then he should first clear the kernel btrfs device list in the memory by doing the module unload/load and followed with the mount -o device option. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b96de000 |
|
03-Jul-2014 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: device_list_add() should not update list when mounted device_list_add() is called when user runs btrfs dev scan, which would add any btrfs device into the btrfs_fs_devices list. Now think of a mounted btrfs. And a new device which contains the a SB from the mounted btrfs devices. In this situation when user runs btrfs dev scan, the current code would just replace existing device with the new device. Which is to note that old device is neither closed nor gracefully removed from the btrfs. The FS is still operational with the old bdev however the device name is the btrfs_device is new which is provided by the btrfs dev scan. reproducer: devmgt[1] detach /dev/sdc replace the missing disk /dev/sdc btrfs rep start -f 1 /dev/sde /btrfs Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120 Total devices 2 FS bytes used 32.00KiB devid 1 size 958.94MiB used 115.88MiB path /dev/sde devid 2 size 958.94MiB used 103.88MiB path /dev/sdd make /dev/sdc to reappear devmgt attach host2 btrfs dev scan btrfs fi show -m Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M Total devices 2 FS bytes used 32.00KiB^M devid 1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong. devid 2 size 958.94MiB used 103.88MiB path /dev/sdd since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be part of the btrfs-fsid when it reappears. If user want it to be part of it then sys admin should be using btrfs device add instead. [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0bfaa9c5 |
|
06-Jul-2014 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: test for valid bdev before kobj removal in btrfs_rm_device commit 99994cd btrfs: dev delete should remove sysfs entry added a btrfs_kobj_rm_device, which dereferences device->bdev... right after we check whether device->bdev might be NULL. I don't honestly know if it's possible to have a NULL device->bdev here, but assuming that it is (given the test), we need to move the kobject removal to be under that test. (Coverity spotted this) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e755f780 |
|
30-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: fix null pointer dereference in clone_fs_devices when name is null when one of the device path is missing btrfs_device name is null. So this patch will check for that. stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff812e18c0>] strlen+0x0/0x30 [<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs] [<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs] [<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0 [<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs] [<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0 [<ffffffff81192a61>] ? __blkdev_put+0x171/0x180 [<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590 [<ffffffff81193426>] ? blkdev_put+0x106/0x110 [<ffffffff81179175>] ? mntput+0x35/0x40 [<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0 [<ffffffff8115c72e>] ? ____fput+0xe/0x10 [<ffffffff81068033>] ? task_work_run+0xb3/0xd0 [<ffffffff8116d547>] SyS_ioctl+0x57/0x90 [<ffffffff817a793e>] ? do_page_fault+0xe/0x10 [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b reproducer: mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2 btrfstune -S 1 /dev/sdg1 modprobe -r btrfs && modprobe btrfs mount -o degraded /dev/sdg1 /btrfs btrfs dev add /dev/sdg3 /btrfs Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b2373f25 |
|
02-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: create sprout should rename fsid on the sysfs as well Creating sprout will change the fsid of the mounted root. do the same on the sysfs as well. reproducer: mount /dev/sdb /btrfs (seed disk) btrfs dev add /dev/sdc /btrfs mount -o rw,remount /btrfs btrfs dev del /dev/sdb /btrfs mount /dev/sdb /btrfs Error: kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0d39376a |
|
02-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: dev add should add its sysfs entry we would need the device links to be created, when device is added. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
99994cde |
|
02-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: dev delete should remove sysfs entry when we delete the device from the mounted btrfs, we would need its corresponding sysfs enty to be removed as well. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8408c716 |
|
18-Jun-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong error handle when the device is missing or is not writeable The original bio might be submitted, so we shoud increase bi_remaining to account for it when we deal with the error that the device is missing or is not writeable, or we would skip the endio handle. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c55f1396 |
|
18-Jun-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix deadlock when mounting a degraded fs The deadlock happened when we mount degraded filesystem, the reproduced steps are following: # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1> # echo 1 > /sys/block/`basename <dev0>`/device/delete # mount -o degraded <dev1> <mnt> The reason was that the counter -- bi_remaining was wrong. If the missing or unwriteable device was the last device in the mapping array, we would not submit the original bio, so we shouldn't increase bi_remaining of it in btrfs_end_bio(), or we would skip the final endio handle. Fix this problem by adding a flag into btrfs bio structure. If we submit the original bio, we will set the flag, and we increase bi_remaining counter, or we don't. Though there is another way to fix it -- decrease bi_remaining counter of the original bio when we make sure the original bio is not submitted, this method need add more check and is easy to make mistake. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e990f167 |
|
18-Jun-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use bio_endio_nodec instead of open code Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
298a8f9c |
|
18-Jun-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: fix NULL pointer crash when running balance and scrub concurrently While running balance, scrub, fsstress concurrently we hit the following kernel crash: [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132 [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0 [56561.524325] Oops: 0000 [#1] SMP [....] [56561.527237] Call Trace: [56561.527309] [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs] [56561.527392] [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0 [56561.527476] [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs] [56561.527561] [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs] [56561.527639] [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0 [56561.527712] [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60 [56561.527788] [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0 [56561.527870] [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0 [56561.527941] [<ffffffff815707f7>] tracesys+0xdd/0xe2 [...] [56561.528304] RIP [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.528395] RSP <ffff88004c0f5be8> [56561.528454] CR2: 0000000000000078 This is because in btrfs_relocate_chunk(), we will free @bdev directly while scrub may still hold extent mapping, and may access freed memory. Fix this problem by wrapping freeing @bdev work into free_extent_map() which is based on reference count. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
8321cf25 |
|
22-May-2014 |
Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> |
fs: btrfs: volumes.c: Fix for possible null pointer dereference There is otherwise a risk of a possible null pointer dereference. Was largely found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Chris Mason <clm@fb.com>
|
#
351fd353 |
|
15-May-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove stale newlines from log messages I've noticed an extra line after "use no compression", but search revealed much more in messages of more critical levels and rare errors. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
29865841 |
|
13-May-2014 |
Wang Shilong <wangsl.fnst@cn.fujitsu.com> |
Btrfs: set right total device count for seeding support Seeding device support allows us to create a new filesystem based on existed filesystem. However newly created filesystem's @total_devices should include seed devices. This patch fix the following problem: # mkfs.btrfs -f /dev/sdb # btrfstune -S 1 /dev/sdb # mount /dev/sdb /mnt # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1 # umount /mnt # mount /dev/sdc /mnt --->fs_devices->total_devices = 2 This is because we record right @total_devices in superblock, but @fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout(). Fix this problem by not resetting @fs_devices->total_devices. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
29cc83f6 |
|
11-May-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix NULL pointer crash of deleting a seed device Same as normal devices, seed devices should be initialized with fs_info->dev_root as well, otherwise we'll get a NULL pointer crash. Cc: Chris Murphy <lists@colorremedies.com> Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4d90d28b |
|
05-Apr-2014 |
Anand Jain <anand.jain@oracle.com> |
btrfs: btrfs_rm_device() should zero mirror SB as well This fix will ensure all SB copies on the disk is zeroed when the disk is intentionally removed. This helps to better manage disks in the user land. This version of patch also merges the Zach patch as below. btrfs: don't double brelse on device rm Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
23f8f9b7 |
|
21-Apr-2014 |
Gui Hecheng <guihc.fnst@cn.fujitsu.com> |
btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space For RAID0,5,6,10, For system chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE For data/meta chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds a leaf. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5f43f86e |
|
21-Apr-2014 |
Gui Hecheng <guihc.fnst@cn.fujitsu.com> |
btrfs: fix wrong max system array size check in kernel space For system chunk array, We copy a "disk_key" and an chunk item each time, so there should be enough space to hold both of them, not only the chunk item. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5a1972bd |
|
16-Apr-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Add ctime/mtime update for btrfs device add/remove. Btrfs will send uevent to udev inform the device change, but ctime/mtime for the block device inode is not udpated, which cause libblkid used by btrfs-progs unable to detect device change and use old cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an error message. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Cc: Karel Zak <kzak@redhat.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7d824b6f |
|
07-May-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: balance filter: add limit of processed chunks This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov <idryomov@gmail.com> CC: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
60999ca4 |
|
26-Mar-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: make device scan less noisy Print the message only when the device is seen for the first time. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d458b054 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
a8c93d4e |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->submit_workers with btrfs_workqueue. Much like the fs_info->workers, replace the fs_info->submit_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
f5961d41 |
|
27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the unused struct async_sched. The struct async_sched is not used by any codes and can be removed. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fusionio.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
c404e0dc |
|
30-Jan-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix use-after-free in the finishing procedure of the device replace During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
efe120a0 |
|
20-Dec-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c7b22bb1 |
|
08-Jan-2014 |
Muthu Kumar <muthu.lkml@gmail.com> |
btrfs: fix missing increment of bi_remaining In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not, we restore the orig_bio but failed to increment bi_remaining for orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix is to increment bi_remaining when we restore the orig bio as well. Reported-by: Fengguang Wu <fengguang.wu@intel.com> CC: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Muthukumar Ratty <muthur@gmail.com> Reviewed-by: Chris Mason <clm@fb.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
bc1e79ac |
|
03-Dec-2013 |
Kent Overstreet <kmo@daterainc.com> |
block: fixup for generic bio chaining btrfs bits got lost in the rebase Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
4f024f37 |
|
11-Oct-2013 |
Kent Overstreet <kmo@daterainc.com> |
block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
|
#
475bf36f |
|
18-Nov-2013 |
Akinobu Mita <akinobu.mita@gmail.com> |
btrfs: fix bio_size_ok() for max_sectors > 0xffff The data type of max_sectors in queue settings is unsigned int. But this value is stored to the local variable whose type is unsigned short in bio_size_ok(). This can cause unexpected result when max_sectors > 0xffff. Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d9b0d9ba |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Replace kmalloc with kmalloc_array Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making it easier to check is that the calculation doesn't wrap or return a smaller allocation Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
fae7f21c |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Use WARN_ON()'s return value in place of WARN_ON(1) Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source code that outputs a more descriptive warnings. Also fix the styling warning of redundant braces that came up as a result of this fix. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
8b558c5f |
|
16-Oct-2013 |
Zach Brown <zab@redhat.com> |
btrfs: remove fs/btrfs/compat.h fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink() and inc_nlink(). This doesn't belong in mainline. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
27087f37 |
|
11-Oct-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: init device stats for new devices Device stats are only initialized (read from tree items) on mount. Trying to read device stats after adding or replacing new devices will return errors. btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two functions that allocate and initialize new btrfs_device structures after a filesystem is mounted. They set the device stats to zero by using kzalloc() which is correct for new devices. The only missing thing was to declare these stats as being valid (device->dev_stats_valid = 1) and this patch adds this missing code. This is the reproducer: TEST_DEV1=/dev/sdzzzzz1 TEST_DEV2=/dev/sdzzzzz2 TEST_DEV3=/dev/sdzzzzz3 TEST_MNT=/mnt mkfs.btrfs $TEST_DEV1 mount $TEST_DEV1 $TEST_MNT btrfs device add $TEST_DEV2 $TEST_MNT btrfs device stat $TEST_MNT btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT btrfs device stat $TEST_MNT umount $TEST_MNT Reported-by: Ondrej Kunc <kunc88@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e649e587 |
|
10-Oct-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: disallow 'btrfs {balance,replace} cancel' on ro mounts For both balance and replace, cancelling involves changing the on-disk state and committing a transaction, which is not a good thing to do on read-only filesystems. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f747cab7 |
|
10-Oct-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: nuke a bogus rw_devices decrement in __btrfs_close_devices On mount failures, __btrfs_close_devices can be called well before dev-replace state is read and ->is_tgtdev_for_dev_replace is set. This leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in __btrfs_close_devices if replace target device happens to be on the lists and we fail early in the mount sequence. Fix this by checking the devid instead of ->is_tgtdev_for_dev_replace before the decrement: for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6174d3cb |
|
01-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: remove unused max_key arg from btrfs_search_forward It is not used for anything. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7d3d1744 |
|
28-Sep-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix memory leak of chunks' extent map As we're hold a ref on looking up the extent map, we need to drop the ref before returning to callers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1357272f |
|
02-Oct-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishing free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev, can be processed before btrfs_scratch_superblock is called, which would result in a use-after-free on btrfs_device contents. Fix this by zeroing the superblock before the rcu callback is registered. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5138cccf |
|
13-Sep-2013 |
Frank Holton <fholton@gmail.com> |
btrfs: Add btrfs: prefix to kernel log output The kernel log entries for device label %s and device fsid %pU are missing the btrfs: prefix. Add those here. Signed-off-by: Frank Holton <fholton@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
55e50e45 |
|
01-Sep-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: do not add replace target to the alloc_list If replace was suspended by the umount, replace target device is added to the fs_devices->alloc_list during a later mount. This is obviously wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that, but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized *after* everything is opened and fs_devices lists are populated. Fix this by checking the devid instead: for replace targets it's always equal to BTRFS_DEV_REPLACE_DEVID. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f45388f3 |
|
28-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix deadlock in uuid scan kthread If there's an ongoing transaction when the uuid scan kthread attempts to create one, the kthread will block, waiting for that transaction to finish while it's keeping locks on the tree root, and in turn the existing transaction is waiting for those locks to be free. The stack trace reported by the kernel follows. [36700.671601] INFO: task btrfs-uuid:15480 blocked for more than 120 seconds. [36700.671602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671602] btrfs-uuid D 0000000000000000 0 15480 2 0x00000000 [36700.671604] ffff880710bd5b88 0000000000000046 ffff8803d36ba850 0000000000030000 [36700.671605] ffff8806d76dc530 ffff880710bd5fd8 ffff880710bd5fd8 ffff880710bd5fd8 [36700.671607] ffff8808098ac530 ffff8806d76dc530 ffff880710bd5b98 ffff8805e4508e40 [36700.671608] Call Trace: [36700.671610] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671620] [<ffffffffa05a3bdf>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [36700.671623] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671629] [<ffffffffa05a5b06>] start_transaction+0x3d6/0x530 [btrfs] [36700.671636] [<ffffffffa05bb1f4>] ? btrfs_get_token_32+0x64/0xf0 [btrfs] [36700.671642] [<ffffffffa05a5fbb>] btrfs_start_transaction+0x1b/0x20 [btrfs] [36700.671649] [<ffffffffa05c8a81>] btrfs_uuid_scan_kthread+0x211/0x3d0 [btrfs] [36700.671655] [<ffffffffa05c8870>] ? __btrfs_open_devices+0x2a0/0x2a0 [btrfs] [36700.671657] [<ffffffff81065fa0>] kthread+0xc0/0xd0 [36700.671659] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671661] [<ffffffff816fcd1c>] ret_from_fork+0x7c/0xb0 [36700.671662] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671663] INFO: task btrfs:15481 blocked for more than 120 seconds. [36700.671664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671665] btrfs D 0000000000000000 0 15481 15212 0x00000004 [36700.671666] ffff880248cbf4c8 0000000000000086 ffff8803d36ba700 ffff8801dbd5c280 [36700.671668] ffff880807815c40 ffff880248cbffd8 ffff880248cbffd8 ffff880248cbffd8 [36700.671669] ffff8805e86a0000 ffff880807815c40 ffff880248cbf4d8 ffff8801dbd5c280 [36700.671670] Call Trace: [36700.671672] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671679] [<ffffffffa05d9b0d>] btrfs_tree_lock+0x6d/0x230 [btrfs] [36700.671680] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671685] [<ffffffffa0582829>] btrfs_search_slot+0x999/0xb00 [btrfs] [36700.671691] [<ffffffffa05bd9de>] ? btrfs_lookup_first_ordered_extent+0x5e/0xb0 [btrfs] [36700.671698] [<ffffffffa05e3e54>] __btrfs_write_out_cache+0x8c4/0xa80 [btrfs] [36700.671704] [<ffffffffa05e4362>] btrfs_write_out_cache+0xb2/0xf0 [btrfs] [36700.671710] [<ffffffffa05c4441>] ? free_extent_buffer+0x61/0xc0 [btrfs] [36700.671716] [<ffffffffa0594c82>] btrfs_write_dirty_block_groups+0x562/0x650 [btrfs] [36700.671723] [<ffffffffa0610092>] commit_cowonly_roots+0x171/0x24b [btrfs] [36700.671729] [<ffffffffa05a4dde>] btrfs_commit_transaction+0x4fe/0xa10 [btrfs] [36700.671735] [<ffffffffa0610af3>] create_subvol+0x5c0/0x636 [btrfs] [36700.671742] [<ffffffffa05d49ff>] btrfs_mksubvol.isra.60+0x33f/0x3f0 [btrfs] [36700.671747] [<ffffffffa05d4bf2>] btrfs_ioctl_snap_create_transid+0x142/0x190 [btrfs] [36700.671752] [<ffffffffa05d4c6c>] ? btrfs_ioctl_snap_create+0x2c/0x80 [btrfs] [36700.671757] [<ffffffffa05d4c9e>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [36700.671759] [<ffffffff8113a764>] ? handle_pte_fault+0x84/0x920 [36700.671764] [<ffffffffa05d87eb>] btrfs_ioctl+0xf0b/0x1d00 [btrfs] [36700.671766] [<ffffffff8113c120>] ? handle_mm_fault+0x210/0x310 [36700.671768] [<ffffffff816f83a4>] ? __do_page_fault+0x284/0x4e0 [36700.671770] [<ffffffff81180aa6>] do_vfs_ioctl+0x96/0x550 [36700.671772] [<ffffffff81170fe3>] ? __sb_end_write+0x33/0x70 [36700.671774] [<ffffffff81180ff1>] SyS_ioctl+0x91/0xb0 [36700.671775] [<ffffffff816fcdc2>] system_call_fastpath+0x16/0x1b Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
795a3321 |
|
27-Aug-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: stop refusing the relocation of chunk 0 AFAICT chunk 0 is no longer special, and so it should be restriped just like every other chunk. One reason for this change is us refusing the relocation can lead to filesystems that can only be mounted ro, and never rw -- see the bugzilla [1] for details. The other reason is that device removal code is already doing this: it will happily relocate chunk 0 is part of shrinking the device. [1] https://bugzilla.kernel.org/show_bug.cgi?id=60594 Reported-by: Xavier Bassery <xavier@bartica.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
726551eb |
|
26-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: adjust the fs_devices->missing count on unmount I noticed that if I tried to mount a file system with -o degraded after having done it once already we would fail to mount. This is because the fs_devices->missing count was getting bumped everytime we mounted, but not getting reset whenever we unmounted. To fix this we just drop the missing count as we're closing devices to make sure this doesn't happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f7171750 |
|
12-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl The handler for the ioctl BTRFS_IOC_FS_INFO was reading the number of devices before acquiring the device list mutex. This could lead to inconsistent results because the update of the device list and the number of devices counter (amongst other counters related to the device list) are updated in volumes.c while holding the device list mutex - except for 2 places, one was volumes.c:btrfs_prepare_sprout() and the other was volumes.c:device_list_add(). For example, if we have 2 devices, with IDs 1 and 2 and then add a new device, with ID 3, and while adding the device is in progress an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of devices of 2 and a max dev id of 3. This would be incorrect. Also, this ioctl handler was reading the fsid while it can be updated concurrently. This can happen when while a new device is being added and the current filesystem is in seeding mode. Example: $ mkfs.btrfs -f /dev/sdb1 $ mkfs.btrfs -f /dev/sdb2 $ btrfstune -S 1 /dev/sdb1 $ mount /dev/sdb1 /mnt/test $ btrfs device add /dev/sdb2 /mnt/test If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it could read an fsid that was never valid (some bits part of the old fsid and others part of the new fsid). Also, it could read a number of devices that doesn't match the number of devices in the list and the max device id, as explained before. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d7306801 |
|
09-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix race between removing a dev and writing sbs This change fixes an issue when removing a device and writing all super blocks run simultaneously. Here's the steps necessary for the issue to happen: 1) disk-io.c:write_all_supers() gets a number of N devices from the super_copy, so it will not panic if it fails to write super blocks for N - 1 devices; 2) Then it tries to acquire the device_list_mutex, but blocks because volumes.c:btrfs_rm_device() got it first; 3) btrfs_rm_device() removes the device from the list, then unlocks the mutex and after the unlock it updates the number of devices in super_copy to N - 1. 4) write_all_supers() finally acquires the mutex, iterates over all the devices in the list and gets N - 1 errors, that is, it failed to write super blocks to all the devices; 5) Because write_all_supers() thinks there are a total of N devices, it considers N - 1 errors to be ok, and therefore won't panic. So this change just makes sure that write_all_supers() reads the number of devices from super_copy after it acquires the device_list_mutex. Conversely, it changes btrfs_rm_device() to update the number of devices in super_copy before it releases the device list mutex. The code path to add a new device (volumes.c:btrfs_init_new_device), already has the right behaviour: it updates the number of devices in super_copy while holding the device_list_mutex. The only code path that doesn't lock the device list mutex before updating the number of devices in the super copy is disk-io.c:next_root_backup(), called by open_ctree() during mount time where concurrency issues can't happen. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
231e88f4 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1473b24e |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_device_fsid() return unsigned long All callers of btrfs_device_fsid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
410ba3a2 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_device_uuid() return unsigned long All callers of btrfs_device_uuid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c1c9ff7c |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Remove superfluous casts from u64 to unsigned long long u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a1e8780a |
|
12-Aug-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: rollback btrfs_device fields on umount It turns out we don't properly rollback in-core btrfs_device state on umount. We zero out ->bdev, ->in_fs_metadata and that's about it. In particular, we don't zero out ->generation, and this can lead to us refusing a mount -- a non-NULL fs_devices->latest_bdev is essential, but btrfs_close_extra_devices will happily assign NULL to ->latest_bdev if the first device on the dev_list happens to be missing and consequently has no bdev attached. This happens because since commit a6b0d5c8 btrfs_close_extra_devices adjusts ->latest_bdev, and in doing that, relies on the ->generation. Fix this, and possibly other problems, by zeroing out everything except for what device_list_add sets, so that a mount right after insmod and 'btrfs dev scan' is no different from any later mount in this respect. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
2208a378 |
|
12-Aug-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add alloc_fs_devices and switch to it In the spirit of btrfs_alloc_device, add a helper for allocating and doing some common initialization of btrfs_fs_devices struct. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
12bd2fc0 |
|
23-Aug-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add btrfs_alloc_device and switch to it Currently btrfs_device is allocated ad-hoc in a few different places, and as a result not all fields are initialized properly. In particular, readahead state is only initialized in device_list_add (at scan time), and not in btrfs_init_new_device (when the new device is added with 'btrfs dev add'). Fix this by adding an allocation helper and switch everybody but __btrfs_close_devices to it. (__btrfs_close_devices is dealt with in a later commit.) Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
53f10659 |
|
12-Aug-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: find_next_devid: root -> fs_info find_next_devid() knows which root to search, so it should take an fs_info instead of an arbitrary root. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
70f80175 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: check UUID tree during mount if required If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
803b2f54 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fill UUID tree initially When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f7a81ea4 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: create UUID tree if required This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
35a3621b |
|
14-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: get rid of sparse warnings make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__ I tried to filter out the warnings for which patches have already been sent to the mailing list, pending for inclusion in btrfs-next. All these changes should be obviously safe. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
eb2067f7 |
|
30-Jul-2013 |
Dave Jones <davej@redhat.com> |
Fix leak in __btrfs_map_block error path If we bail out when the stripe alloc fails, we need to undo the earlier allocation of raid_map. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
395927a9 |
|
29-Jul-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: optimize function btrfs_read_chunk_tree After reading all device items from the chunk tree, don't exit the loop and then navigate down the tree again to find the chunk items. Instead just read all device items and chunk items with a single tree search. This is possible because all device items are found before any chunk item in the chunks tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3cae210f |
|
15-Jul-2013 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert Some codes still use the cpu_to_lexx instead of the BTRFS_SETGET_STACK_FUNCS declared in ctree.h. Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec and other structures. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ccf39f92 |
|
11-Jul-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: make errors in btrfs_num_copies less noisy The log message level 'critical' is verbose enough, 'emergency' beeps on all terminals. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d7901554 |
|
04-Mar-2013 |
Carey Underwood <carey@scaletech.ca> |
Btrfs: Release uuid_mutex for shrink during device delete Device scanning waits on the uuid_mutex, which can result in a very long wait if dev delete is shrinking the device. Signed-off-by: Carey Underwood <cwillu@cwillu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
cd633972 |
|
15-Jul-2013 |
Sachin Kamat <sachin.kamat@linaro.org> |
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
#
6df9a95e |
|
27-Jun-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: make the chunk allocator completely tree lockless When adjusting the enospc rules for relocation I ran into a deadlock because we were relocating the only system chunk and that forced us to try and allocate a new system chunk while holding locks in the chunk tree, which caused us to deadlock. To fix this I've moved all of the dev extent addition and chunk addition out to the delayed chunk completion stuff. We still keep the in-memory stuff which makes sure everything is consistent. One change I had to make was to search the commit root of the device tree to find a free dev extent, and hold onto any chunk em's that we allocated in that transaction so we do not allocate the same dev extent twice. This has the side effect of fixing a bug with balance that has been there ever since balance existed. Basically you can free a block group and it's dev extent and then immediately allocate that dev extent for a new block group and write stuff to that dev extent, all within the same transaction. So if you happen to crash during a balance you could come back to a completely broken file system. This patch should keep these sort of things from happening in the future since we won't be able to allocate free'd dev extents until after the transaction commits. This has passed all of the xfstests and my super annoying stress test followed by a balance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
a70c6172 |
|
19-Jun-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong mirror number tuning Now reading the data from the target device of the replace operation is allowed, so the mirror number that is greater than the stripes number of a chunk is valid, we will tune it when we find there is no target device later. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
97a184fe |
|
01-Jun-2013 |
Thomas Meyer <thomas@m3y3r.de> |
Btrfs: Cocci spatch "ptr_ret.spatch" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
183860f6 |
|
17-May-2013 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: device delete to get errors from the kernel when user runs command btrfs dev del the raid requisite error if any goes to the /var/log/messages, its not good idea to clutter messages with these user (knowledge) errors, further user don't have to review the system messages to know problem with the cli it should be dropped to the user as part of the cli return. to bring this feature created a set of the ERROR defined BTRFS_ERROR_DEV* error codes and created their error string. I expect this enum to be added with other error which we might want to communicate to the user land v3: moved the code with in the file no logical change v1->v2: introduce error codes for the device mgmt usage v1: adds a parameter in the ioctl arg struct to carry the error string Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
cb517eab |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: cleanup the similar code of the fs root read There are several functions whose code is similar, such as btrfs_find_last_root() btrfs_read_fs_root_no_radix() Besides that, some functions are invoked twice, it is unnecessary, for example, we are sure that all roots which is found in btrfs_find_orphan_roots() have their orphan items, so it is unnecessary to check the orphan item again. So cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
9be3395b |
|
17-May-2013 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: use a btrfs bioset instead of abusing bio internals Btrfs has been pointer tagging bi_private and using bi_bdev to store the stripe index and mirror number of failed IOs. As bios bubble back up through the call chain, we use these to decide if and how to retry our IOs. They are also used to count IO failures on a per device basis. Recently a bio tracepoint was added lead to crashes because we were abusing bi_bdev. This commit adds a btrfs bioset, and creates explicit fields for the mirror number and stripe index. The plan is to extend this structure for all of the fields currently in struct btrfs_bio, which will mean one less kmalloc in our IO path. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Tejun Heo <tj@kernel.org>
|
#
8250dabe |
|
11-May-2013 |
Andreas Philipp <philipp.andreas@gmail.com> |
Correct allowed raid levels on balance. Raid5 with 3 devices is well defined while the old logic allowed raid5 only with a minimum of 4 devices when converting the block group profile via btrfs balance. Creating a raid5 with just three devices using mkfs.btrfs worked always as expected. This is now fixed and the whole logic is rewritten. Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
48a3b636 |
|
25-Apr-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: make static code static & remove dead code Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
fb7669b5 |
|
23-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: don't BUG_ON() in btrfs_num_copies A user sent me a btrfs-image that was panicing because of some corruption. This is because we pass in a bogus value to btrfs_num_copies, and it panics. Instead just return 1. We only call btrfs_num_copies to see if there are other copies to try and read for things, so if we just return 1 it will make the callers exit out with an appropriate error value. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
9bb91873 |
|
19-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: deal with bad mappings in btrfs_map_block Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find a chunk for a particular block we were trying to map. This happened because the block was bogus. We shouldn't be BUG_ON()'ing in this case, just print a message and return an error. This came from reada_add_block and it appears to deal with an error fine so we should be good there. Thanks, Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
ceda0864 |
|
11-Apr-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use a lock to protect incompat/compat flag of the super block The following case will make the incompat/compat flag of the super block be recovered. Task1 |Task2 flags = btrfs_super_incompat_flags(); | |flags = btrfs_super_incompat_flags(); flags |= new_flag1; | |flags |= new_flag2; btrfs_set_super_incompat_flags(flags); | |btrfs_set_super_incompat_flags(flags); the new_flag1 is recovered. In order to avoid this problem, we introduce a lock named super_lock into the btrfs_fs_info structure. If we want to update incompat/compat flags of the super block, we must hold it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
f63e0cca |
|
04-Apr-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: ignore device open failures in __btrfs_open_devices This: # mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test would lead to a blkdev open/close mismatch when the mount fails, and a permanently busy (opened O_EXCL) sdb2: # wipefs -a /dev/sdb2 wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy It's because btrfs_open_devices() may open some devices, fail on the last one, and return that failure stored in "ret." The mount then fails, but the caller then does not clean up the open devices. Chris assures me that: "btrfs_open_devices just means: go off and open every bdev you can from this uuid. It should return success if we opened any of them at all." So change the logic to ignore any open failures; just skip processing of that device. Later on it's decided whether we have enough devices to continue. Reported-by: Jan Safranek <jsafrane@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
09a2a8f9 |
|
05-Apr-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix bad extent logging A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
c2cf52eb |
|
19-Mar-2013 |
Simon Kirby <sim@hostway.ca> |
Btrfs: Include the device in most error printk()s With more than one btrfs volume mounted, it can be very difficult to find out which volume is hitting an error. btrfs_error() will print this, but it is currently rigged as more of a fatal error handler, while many of the printk()s are currently for debugging and yet-unhandled cases. This patch just changes the functions where the device information is already available. Some cases remain where the root or fs_info is not passed to the function emitting the error. This may introduce some confusion with volumes backed by multiple devices emitting errors referring to the primary device in the set instead of the one on which the error occurred. Use btrfs_printk(fs_info, format, ...) rather than writing the device string every time, and introduce macro wrappers ala XFS for brevity. Since the function already cannot be used for continuations, print a newline as part of the btrfs_printk() message rather than at each caller. Signed-off-by: Simon Kirby <sim@hostway.ca> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
aa8b57aa |
|
05-Feb-2013 |
Kent Overstreet <koverstreet@google.com> |
block: Use bio_sectors() more consistently Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>
|
#
835d974f |
|
18-Mar-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: handle a bogus chunk tree nicely If you restore a btrfs-image file system and try to mount that file system we'll panic. That's because btrfs-image restores and just makes one big chunk to envelope the whole disk, since they are really only meant to be messed with by our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so that we no longer panic but instead just return an error and fail to mount. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
bc178622 |
|
09-Mar-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: use rcu_barrier() to wait for bdev puts at unmount Doing this would reliably fail with -EBUSY for me: # mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2 ... unable to open /dev/sdb2: Device or resource busy because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it. Using systemtap to track bdev gets & puts shows a kworker thread doing a blkdev put after mkfs attempts a get; this is left over from the unmount path: btrfs_close_devices __btrfs_close_devices call_rcu(&device->rcu, free_device); free_device INIT_WORK(&device->rcu_work, __free_device); schedule_work(&device->rcu_work); so unmount might complete before __free_device fires & does its blkdev_put. Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait until all blkdev_put()s are done, and the device is truly free once unmount completes. Cc: stable@vger.kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3a01aa7a |
|
06-Mar-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix a mismerge in btrfs_balance() Raid56 merge (merge commit e942f88) had mistakenly removed a call to __cancel_balance(), which resulted in balance not cleaning up after itself after a successful finish. (Cleanup includes switching the state, removing the balance item and releasing mut_ex_op testnset lock.) Bring it back. Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0f788c58 |
|
04-Mar-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: do not BUG_ON on aborted situation Btrfs balance can easily hit BUG_ON in these places, but we want to it bail out gracefully after we force the whole filesystem to readonly. So we use btrfs_std_error hook in place of BUG_ON. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2d8946c5 |
|
27-Feb-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: remove a printk from scan_one_device Dave pointed out that he saw messages from btrfs although there was no such filesystem on his computers. The automatic device scan is called on every new blockdevice if the usual distro udev rule set is used. The printk introduced in 6f60cbd3ae442c was a remainder from copying portions of code from btrfs_get_bdev_and_sb which is used under different conditions and the warning makes sense there. Reported-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
fda2832f |
|
26-Feb-2013 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: cleanup for open-coded alignment Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
86db2578 |
|
20-Feb-2013 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: fix max chunk size on raid5/6 We try to limit the size of a chunk to 10GB, which keeps the unit of work reasonable during balance and resize operations. The limit checks were taking into account the number of copies of the data we had but what they really should be doing is comparing against the logical size of the chunk we're creating. This moves the code around a little to use the count of data stripes from raid5/6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1cba0cdf |
|
20-Feb-2013 |
Thomas Gleixner <tglx@linutronix.de> |
btrfs: Init io_lock after cloning btrfs device struct __btrfs_close_devices() clones btrfs device structs with memcpy(). Some of the fields in the clone are reinitialized, but it's missing to init io_lock. In mainline this goes unnoticed, but on RT it leaves the plist pointing to the original about to be freed lock struct. Initialize io_lock after cloning, so no references to the original struct are left. Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
cdb4c574 |
|
19-Feb-2013 |
Zach Brown <zab@redhat.com> |
btrfs: define BTRFS_MAGIC as a u64 value super.magic is an le64 but it's treated as an unterminated string when compared against BTRFS_MAGIC which is defined as a string. Instead define BTRFS_MAGIC as a normal hex value and use endian helpers to compare it to the super's magic. I tested this by mounting an fs made before the change and made sure that it didn't introduce sparse errors. This matches a similar cleanup that is pending in btrfs-progs. David Sterba pointed out that we should fix the kernel side as well :). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
3e39cea6 |
|
12-Feb-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for selecting only completely empty chunks Enhance balance usage filter by making it possible to balance out only completely empty chunks. Today, usage filter properly acts on values from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0 selects no chunks. This commit changes the usage=0 case: the new meaning is to restripe only completely empty chunks and nothing else. Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
bf023ecf |
|
12-Feb-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: eliminate a use-after-free in btrfs_balance() Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed above in __cancel_balance() in all cases except for balance pause. Fix this by moving the offending check a couple statements above, the meaning of the check is preserved. Reported-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
063d006f |
|
30-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk WARN_ON isn't enough, we need to stop the loop if for any reason we would overrun the devices_info array. I tried to track down the connection between the length of the alloc_devices list and the rw_devices counter but it wasn't immediately obvious, so be defensive about it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
0f5d42b2 |
|
31-Jan-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: remove extent mapping if we fail to add chunk I got a double free error when unmounting a file system that failed to add a chunk during its operation. This is because we will kfree the mapping that we created but leave the extent_map in the em_tree for chunks. So to fix this just remove the extent_map when we error out so we don't run into this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
04487488 |
|
29-Jan-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix chunk allocation error handling If we error out allocating a dev extent we will have already created the block group and such which will cause problems since the allocator may have tried to allocate out of the block group that no longer exists. This will cause BUG_ON()'s in the bio submission path. This also makes a failure to allocate a dev extent a non-abort error, we will just clean up the dev extents we did allocate and exit. Now if we fail to delete the dev extents we will abort since we can't have half of the dev extents hanging around, but this will make us much less likely to abort. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
de98ced9 |
|
29-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e6ec716f |
|
16-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: make raid attr array more readable The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
6f60cbd3 |
|
15-Feb-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: access superblock via pagecache in scan_one_device btrfs_scan_one_device is calling set_blocksize() which can race with a concurrent process making dirty page cache pages. It can end up dropping dirty page cache pages on the floor, which isn't very nice when someone is just running btrfs dev scan to find filesystems on the box. Now that udev is registering btrfs devices as it discovers them, we can actually end up racing with our own mkfs program too. When this happens, we drop some of the important blocks written by mkfs. This commit changes scan_one_device to read the super out of the page cache instead of trying to use bread. This way we don't have to care about the blocksize of the device. This also drops the invalidate_bdev() call. It wasn't very polite to invalidate during the scan either. mkfs is putting the super into the page cache, there's no reason to invalidate at this point. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1f0905ec |
|
05-Feb-2013 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: remove conflicting check for minimum number of devices in raid56 The device removal code was incorrectly checking against two different limits for raid5 and raid6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
53b381b3 |
|
29-Jan-2013 |
David Woodhouse <David.Woodhouse@intel.com> |
Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3c911608 |
|
30-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: don't try to notify udev about missing devices If we remove a missing device, bdev is null, and if we send that off to btrfs_kobject_uevent we'll panic. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c9f01bfe |
|
16-Jan-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong max device number for single profile The max device number of single profile is 1, not 0 (0 means 'as many as possible'). Fix it. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
a105bb88 |
|
21-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix a regression in balance usage filter Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which was merged into 3.8-rc1, has introduced a regression by removing logic that was guarding us against bad user input. Bring it back. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ed0fb78f |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: bring back balance pause/resume logic Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1 as part of dev-replace merge). Offending commit took a stab at making mutually exclusive volume operations (add_dev, rm_dev, resize, balance, replace_dev) not block behind volume_mutex if another such operation is in progress and instead return an error right away. Balancing front-end relied on the blocking behaviour, so the fix is ugly, but short of a complete rework, it's the best we can do. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
cc975eb4 |
|
07-Dec-2012 |
Lukas Czerner <lczerner@redhat.com> |
btrfs: get the device in write mode when deleting it When we're deleting the device we should get it in write mode since we're going to re-write the super block magic on that device. And it should fail if the device is read-only. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
#
31e50229 |
|
21-Nov-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: put raid properties into global table Raid properties can be shared among raid calculation code, we can put them into a global table to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
70c8a91c |
|
11-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: log changed inodes based on the extent map tree We don't really need to copy extents from the source tree since we have all of the information already available to us in the extent_map tree. So instead just write the extents straight to the log tree and don't bother to copy the extent items from the source tree. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b8b8ff59 |
|
06-Dec-2012 |
Lukas Czerner <lczerner@redhat.com> |
btrfs: Notify udev when removing device Currently udev does not know about the device being removed from the file system. This may result in the situation where we're unable to mount the file system by UUID or by LABEL because the by-uuid and by-label links may still point to the device which is no longer part of the btrfs file system and hence does not have any btrfs super block. It can be easily reproduced by the following: mkfs.btrfs -L bugfs /dev/loop[0-6] mount /dev/loop0 /mnt/test btrfs device delete /dev/loop0 /mnt/test umount /mnt/test mount LABEL=bugfs /mnt/test <---- this fails then see: ls -l /dev/disk/by-label/bugfs which will still point to the /dev/loop0 We did not noticed this before because libblkid would send the udev event for us when it notice that the link does not fit the reality, however it does not do that anymore and completely relies on udev information. Fix this by sending the KOBJ_CHANGE event to the bdev kobject after successful device removal. Note that this does not affect device addition, because we will open the device prior the addition from userspace and udev will notice that and reread the device afterwards. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f9c83748 |
|
27-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fix a build warning for an unused label This issue was detected by the "0-DAY kernel build testing". fs/btrfs/volumes.c: In function 'btrfs_rm_device': fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label] Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ad6d620e |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: allow repair code to include target disk when searching mirrors Make the target disk of a running device replace operation available for reading. This is only used as a last ressort for the defect repair procedure. And it is dependent on the location of the data block to read, because during an ongoing device replace operation, the target drive is only partially filled with the filesystem data. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
30d9861f |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: optionally avoid reads from device replace source drive It is desirable to be able to configure the device replace procedure to avoid reading the source drive (the one to be copied) whenever possible. This is useful when the number of read errors on this disk is high, because it would delay the copy procedure alot. Therefore there is an option to avoid reading from the source disk unless the repair procedure really needs to access it. The regular read req asks for mapping the block with mirror_num == 0, in this case the source disk is avoided whenever possible. The repair code selects the mirror_num explicitly (mirror_num != 0), this case is not changed by this commit. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
472262f3 |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: changes to live filesystem are also written to replacement disk During a running dev replace operation, all write requests to the live filesystem are duplicated to also write to the target drive. Therefore btrfs_map_block() is changed to duplicate stripes that are written to the source disk of a device replace procedure to be written to the target disk as well. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
29a8d9a0 |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() Before this commit, btrfs_map_block() was called with REQ_WRITE in order to retrieve the list of mirrors for a disk block. This needs to be changed for the device replace procedure since it makes a difference whether you are asking for read mirrors or for locations to write to. GET_READ_MIRRORS is introduced as a new interface to call btrfs_map_block(). In the current commit, the functionality is not yet changed, only the interface for GET_READ_MIRRORS is introduced and all the places that should use this new interface are adapted. The reason that REQ_WRITE cannot be abused anymore to retrieve a list of read mirrors is that during a running dev replace operation all write requests to the live filesystem are duplicated to also write to the target drive. Keep in mind that the target disk is only partially a valid copy of the source disk while the operation is ongoing. All writes go to the target disk, but not all reads would return valid data on the target disk. Therefore it is not possible anymore to abuse a REQ_WRITE interface to find valid mirrors for a REQ_READ. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
8dabb742 |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: change core code of btrfs to support the device replace operations This commit contains all the essential changes to the core code of Btrfs for support of the device replace procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e93c89c1 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add new sources for device replace code This adds a new file to the sources together with the header file and the changes to ioctl.h and ctree.h that are required by the new C source file. Additionally, 4 new functions are added to volume.c that deal with device creation and destruction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
61891923 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: handle errors from btrfs_map_bio() everywhere With the addition of the device replace procedure, it is possible for btrfs_map_bio(READ) to report an error. This happens when the specific mirror is requested which is located on the target disk, and the copy operation has not yet copied this block. Hence the block cannot be read and this error state is indicated by returning EIO. Some background information follows now. A new mirror is added while the device replace procedure is running. btrfs_get_num_copies() returns one more, and btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk location is involved that was already handled by the device replace copy operation. The assigned mirror num is the highest mirror number, e.g. the value 3 in case of RAID1. If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select any mirror), the copy on the target drive is never selected because that disk shall be able to perform the write requests as quickly as possible. The parallel execution of read requests would only slow down the disk copy procedure. Second case is that btrfs_map_bio() is called with mirror_num > 0. This is done from the repair code only. In this case, the highest mirror num is assigned to the target disk, since it is used last. And when this mirror is not available because the copy procedure has not yet handled this area, an error is returned. Everywhere in the code the handling of such errors is added now. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
63a212ab |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow some operations on the device replace target device This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5ac00add |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow mutually exclusive admin operations from user mode Btrfs admin operations that are manually started from user mode and that cannot be executed at the same time return -EINPROGRESS. A common way to enter and leave this locked section is introduced since it used to be specific to the balance operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
aa1b8cd4 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: pass fs_info instead of root A small number of functions that are used in a device replace procedure when the operation is resumed at mount time are unable to pass the same root pointer that would be used in the regular (ioctl) context. And since the root pointer is not required, only the fs_info is, the root pointer argument is replaced with the fs_info pointer argument. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a8a6dab7 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add btrfs_scratch_superblock() function This new function is used by the device replace procedure in a later patch. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3ec706c8 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree This is required for the device replace procedure in a later step. Two calling functions also had to be changed to have the fs_info pointer: repair_io_failure() and scrub_setup_recheck_block(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5d964051 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree This is required for the device replace procedure in a later step. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
7ba15b7d |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add two more find_device() methods The new function btrfs_find_device_missing_or_by_path() will be used for the device replace procedure. This function itself calls the second new function btrfs_find_device_by_path(). Unfortunately, it is not possible to currently make the rest of the code use these functions as well, since all functions that look similar at first view are all a little bit different in what they are doing. But in the future, new code could benefit from these two new functions, and currently, device replace uses them. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
beaf8ab3 |
|
12-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: move some common code into a subfunction Some code to open block devices, to read the superblock and to handle errors was repeated multiple times in 3 places, and the following patch makes use of it as well. This code is now moved into a subfunction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d25628bd |
|
14-Nov-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: protect devices list with its mutex Since we've kill the bigger one volume_mutex, we need to add devices list mutex back. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d03f918a |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: Don't trust the superblock label and simply printk("%s") it Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the superblock and make Btrfs printk("%s") crash while holding the uuid_mutex since nobody forces a limit on the string. Since the uuid_mutex is significant, the system would be unusable afterwards. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
31b1a2bd |
|
03-Nov-2012 |
Julia Lawall <Julia.Lawall@lip6.fr> |
fs/btrfs: use WARN Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d1423248 |
|
31-Oct-2012 |
Masanari Iida <standby24x7@gmail.com> |
Btrfs: Fix typo in fs/btrfs Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0253f40e |
|
26-Oct-2012 |
jeff.liu <jeff.liu@oracle.com> |
Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev() Remove an invalid size check up from btrfs_shrink_dev(). The new size should not larger than the device->total_bytes as it was already verified before coming to here(i.e. new_size < old_size). Remove invalid check up for btrfs_shrink_dev(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
de1ee92a |
|
19-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: recheck bio against block device when we map the bio Alex reported a problem where we were writing between chunks on a rbd device. The thing is we do bio_add_page using logical offsets, but the physical offset may be different. So when we map the bio now check to see if the bio is still ok with the physical offset, and if it is not split the bio up and redo the bio_add_page with the physical sector. This fixes the problem for Alex and doesn't affect performance in the normal case. Thanks, Reported-and-tested-by: Alex Elder <elder@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3fed40cc |
|
13-Sep-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: cleanup duplicated division functions div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
926ccfef |
|
31-Oct-2012 |
Masanari Iida <standby24x7@gmail.com> |
Btrfs: Fix printk and variable name Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
671415b7 |
|
16-Oct-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix deadlock caused by the nested chunk allocation Steps to reproduce: # mkfs.btrfs -m raid1 <disk1> <disk2> # btrfstune -S 1 <disk1> # mount <disk1> <mnt> # btrfs device add <disk3> <disk4> <mnt> # mount -o remount,rw <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 Deadlock happened. It is because of the nested chunk allocation. When we wrote the data into the filesystem, we would allocate the data chunk because there was no data chunk in the filesystem. At the end of the data chunk allocation, we should insert the metadata of the data chunk into the extent tree, but there was no raid1 chunk, so we tried to lock the chunk allocation mutex to allocate the new chunk, but we had held the mutex, the deadlock happened. By rights, we would allocate the raid1 chunk when we added the second device because the profile of the seed filesystem is raid1 and we had two devices. But we didn't do that in fact. It is because the last step of the first device insertion didn't commit the transaction. So when we added the second device, we didn't cow the tree, and just inserted the relative metadata into the leaves which were generated by the first device insertion, and its profile was dup. So, I fix this problem by commiting the transaction at the end of the first device insertion. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
5af3e8cc |
|
01-Aug-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: make filesystem read-only when submitting barrier fails So far the return code of barrier_all_devices() is ignored, which means that errors are ignored. The result can be a corrupt filesystem which is not consistent. This commit adds code to evaluate the return code of barrier_all_devices(). The normal btrfs_error() mechanism is used to switch the filesystem into read-only mode when errors are detected. In order to decide whether barrier_all_devices() should return error or success, the number of disks that are allowed to fail the barrier submission is calculated. This calculation accounts for the worst RAID level of metadata, system and data. If single, dup or RAID0 is in use, a single disk error is already considered to be fatal. Otherwise a single disk error is tolerated. The calculation of the number of disks that are tolerated to fail the barrier operation is performed when the filesystem gets mounted, when a balance operation is started and finished, and when devices are added or removed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
48940662 |
|
07-May-2012 |
Daniel J Blueman <daniel@quora.org> |
btrfs: fix message printing Fix various messages to include newline and module prefix. Signed-off-by: Daniel J Blueman <daniel@quora.org>
|
#
005d6427 |
|
18-Sep-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: move transaction aborts to the point of failure Call btrfs_abort_transaction as early as possible when an error condition is detected, that way the line number reported is useful and we're not clueless anymore which error path led to the abort. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
5ee0844d |
|
27-Aug-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: revert checksum error statistic which can cause a BUG() Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device statistic counters for detected IO and checksum errors to Linux 3.5. The statistic part that counts checksum errors in end_bio_extent_readpage() can cause a BUG() in a subfunction: "kernel BUG at fs/btrfs/volumes.c:3762!" That part is reverted with the current patch. However, the counting of checksum errors in the scrub context remains active, and the counting of detected IO errors (read, write or flush errors) in all contexts remains active. Cc: stable <stable@vger.kernel.org> # 3.5 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
66657b31 |
|
01-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: barrier before waitqueue_active We need a barrir before calling waitqueue_active otherwise we will miss wakeups. So in places that do atomic_dec(); then atomic_read() use atomic_dec_return() which imply a memory barrier (see memory-barriers.txt) and then add an explicit memory barrier everywhere else that need them. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
99f5944b |
|
02-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: do not strdup non existent strings When we close devices we add back empty devices for some reason that escapes me. In the case of a missing dev we don't allocate an rcu_string for it's name, so check to see if the device has a name and if it doesn't don't bother strdup()'ing it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
34eaadaf |
|
25-Jul-2012 |
Artem Bityutskiy <artem.bityutskiy@linux.intel.com> |
btrfs: nuke write_super from comments The '->write_super' superblock method is gone, and this patch removes all the references to 'write_super' from btrfs. Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a98cdb85 |
|
17-Jul-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: suppress printk() if all device I/O stats are zero Code is added to suppress the I/O stats printing at mount time if all statistic values are zero. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
5021976d |
|
17-Jul-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: remove unwanted printk() for btrfs device I/O stats People complained about the annoying kernel log message "btrfs: no dev_stats entry found ... (OK on first mount after mkfs)" everytime a filesystem is mounted for the first time after running mkfs. Since the distribution of the btrfs-progs is not synchronized to the kernel version, mkfs like it is now will be used also in the future. Then this message is not useful to find errors, it is just annoying. This commit removes the printk(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
02db0844 |
|
21-Jun-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: add DEVICE_READY ioctl This will be used in conjunction with btrfs device ready <dev>. This is needed for initrd's to have a nice and lightweight way to tell if all of the devices needed for a file system are in the cache currently. This keeps them from having to do mount+sleep loops waiting for devices to show up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
b27f7c0c |
|
22-Jun-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: join DEV_STATS ioctls to one Commit c11d2c236cc260b36 (Btrfs: add ioctl to get and reset the device stats) introduced two ioctls doing almost the same thing distinguished by just the ioctl number which encodes "do reset after read". I have suggested http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html to implement it via the ioctl args. This hasn't happen, and I think we should use a more clean way to pass flags and should not waste ioctl numbers. CC: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
2b6ba629 |
|
22-Jun-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: resume balance on rw (re)mounts properly This introduces btrfs_resume_balance_async(), which, given that restriper state was recovered earlier by btrfs_recover_balance(), resumes balance in btrfs-balance kthread. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
68310a5e |
|
22-Jun-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: restore restriper state on all mounts Fix a bug that triggered asserts in btrfs_balance() in both normal and resume modes -- restriper state was not properly restored on read-only mounts. This factors out resuming code from btrfs_restore_balance(), which is now also called earlier in the mount sequence to avoid the problem of some early writes getting the old profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
597a60fa |
|
14-Jun-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: don't count I/O statistic read errors for missing devices It is normal behaviour of the low level btrfs function btrfs_map_bio() to complete a bio with -EIO if the device is missing, instead of just preventing the bio creation in an earlier step. This used to cause I/O statistic read error increments and annoying printk_ratelimited messages. This commit fixes the issue. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reported-by: Carey Underwood <cwillu@cwillu.com>
|
#
606686ee |
|
04-Jun-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: use rcu to protect device->name Al pointed out that we can just toss out the old name on a device and add a new one arbitrarily, so anybody who uses device->name in printk could possibly use free'd memory. Instead of adding locking around all of this he suggested doing it with RCU, so I've introduced a struct rcu_string that does just that and have gone through and protected all accesses to device->name that aren't under the uuid_mutex with rcu_read_lock(). This protects us and I will use it for dealing with removing the device that we used to mount the file system in a later patch. Thanks, Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
733f4fbb |
|
25-May-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: read device stats on mount, write modified ones during commit The device statistics are written into the device tree with each transaction commit. Only modified statistics are written. When a filesystem is mounted, the device statistics for each involved device are read from the device tree and used to initialize the counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
c11d2c23 |
|
25-May-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add ioctl to get and reset the device stats An ioctl interface is added to get the device statistic counters. A second ioctl is added to atomically get and reset these counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
442a4f63 |
|
25-May-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add device counters for detected IO and checksum errors The goal is to detect when drives start to get an increased error rate, when drives should be replaced soon. Therefore statistic counters are added that count IO errors (read, write and flush). Additionally, the software detected errors like checksum errors and corrupted blocks are counted. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
f8c5d0b4 |
|
10-May-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix wrong error returned by adding a device Reproduce: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs dev add /dev/sdb8 /mnt/btrfs ERROR: error adding the device '/dev/sdb8' - Invalid argument Since we mount with readonly options, and /dev/sdb7 is not a seeding one, a readonly notification is preferred. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
|
#
3e74317a |
|
26-Apr-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix repair code for RAID10 btrfs_map_block sets mirror_num, so that the repair code knows eventually which device gave us the read error. For RAID10, mirror_num must be 1 or 2. Before this fix mirror_num was incorrectly related to our stripe index. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
48d28232 |
|
14-Apr-2012 |
Julia Lawall <Julia.Lawall@lip6.fr> |
fs/btrfs/volumes.c: add missing free_fs_devices Free fs_devices as done in the error-handling code just below. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
|
#
37db63a4 |
|
13-Apr-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix max chunk size check in chunk allocator Fix a bug, where in case we need to adjust stripe_size so that the length of the resulting chunk is less than or equal to max_chunk_size, DUP chunks turn out to be only half as big as they could be. Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
b89203f7 |
|
12-Apr-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix eof while discarding extents We miscalculate the length of extents we're discarding, and it leads to an eof of device. Reported-by: Daniel Blueman <daniel@quora.org> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3c4bb26b |
|
27-Mar-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: flush out and clean up any block device pages during mount Btrfs puts the filesystem metadata into its own address space, and somehow the block device address space isn't getting onto disk properly before a mount. The end result is that a loop of mkfs and mounting the filesystem will sometimes find stale or incorrect data. This commit should fix it by sprinkling fdatawrites and invalidate_bdev calls around. This is a short term measure to make sure it is fixed. The block devices really should be flushed and cleaned up higher in the stack. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
213e64da |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix infinite loop in btrfs_shrink_device() If relocate of block group 0 fails with ENOSPC we end up infinitely looping because key.offset -= 1 statement in that case brings us back to where we started. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
e4837f8f |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow dup for data chunks in mixed mode Generally we don't allow dup for data, but mixed chunks are special and people seem to think this has its use cases. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
6728b198 |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: validate target profiles only if we are going to use them Do not run sanity checks on all target profiles unless they all will be used. This came up because alloc_profile_is_valid() is now more strict than it used to be. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
0c460c0d |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: move alloc_profile_is_valid() to volumes.c Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
e8920a64 |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: make profile_is_valid() check more strict "0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
899c81ea |
|
27-Mar-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add wrappers for working with alloc profiles Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
727011e0 |
|
06-Aug-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allow metadata blocks larger than the page size A few years ago the btrfs code to support blocks lager than the page size was disabled to fix a few corner cases in the page cache handling. This fixes the code to properly support large metadata blocks again. Since current kernels will crash early and often with larger metadata blocks, this adds an incompat bit so that older kernels can't mount it. This also does away with different blocksizes for nodes and leaves. You get a single block size for all tree blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
79787eaa |
|
12-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: replace many BUG_ONs with proper error handling btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
3acd3953 |
|
08-Sep-2011 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: Remove BUG_ON from __finish_chunk_alloc() btrfs_alloc_chunk() unconditionally BUGs on any error returned from __finish_chunk_alloc() so there's no need for two BUG_ON lines. Remove the one from __finish_chunk_alloc(). Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
1dd4602f |
|
08-Sep-2011 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: Remove BUG_ON from __btrfs_alloc_chunk() We BUG_ON() error from add_extent_mapping(), but that error looks pretty easy to bubble back up - as far as I can tell there have not been any permanent modifications to fs state at that point. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
2cdcecbc |
|
08-Sep-2011 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: Don't BUG_ON insert errors in btrfs_alloc_dev_extent() The only caller of btrfs_alloc_dev_extent() is __btrfs_alloc_chunk() which already bugs on any error returned. We can remove the BUG_ON's in btrfs_alloc_dev_extent() then since __btrfs_alloc_chunk() will "catch" them anyway. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
4ed1d16e |
|
10-Aug-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON errors in __finish_chunk_alloc() All callers of __finish_chunk_alloc() BUG_ON() return value, so it's trivial for us to always bubble up any errors caught in __finish_chunk_alloc() to be caught there. Signed-off-by: Mark Fasheh <mfasheh@suse.de>
|
#
143bede5 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
a6b0d5c8 |
|
20-Feb-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make sure we update latest_bdev When we are setting up the mount, we close all the devices that were not actually part of the metadata we found. But, we don't make sure that one of those devices wasn't fs_devices->latest_bdev, which means we can do a use after free on the one we closed. This updates latest_bdev as it goes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
285190d9 |
|
16-Feb-2012 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: check return value of lookup_extent_mapping() correctly This patch corrects error checking of lookup_extent_mapping(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
|
#
8a334426 |
|
07-Oct-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: silence warning in raid array setup Raid array setup code creates an extent buffer in an usual way. When the PAGE_CACHE_SIZE is > super block size, the extent pages are not marked up-to-date, which triggers a WARN_ON in the following write_extent_buffer call. Add an explicit up-to-date call to silence the warning. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
96bdc7dc |
|
16-Jan-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use larger system chunks system chunks by default are very small. This makes them slightly larger and also fixes the conditional checks to make sure we don't allocate a billion of them at once. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
19a39dce |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add balance progress reporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
a7e99c69 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for canceling restriper Implement an ioctl for canceling restriper. Currently we wait until relocation of the current block group is finished, in future this can be done by triggering a commit. Balance item is deleted and no memory about the interrupted balance is kept. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
837d5b6e |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for pausing restriper Implement an ioctl for pausing restriper. This pauses the relocation, but balance is still considered to be "in progress": balance item is not deleted, other volume operations cannot be started, etc. If paused in the middle of profile changing operation we will continue making allocations with the target profile. Add a hook to close_ctree() to pause restriper and free its data structures on unmount. (It's safe to unmount when restriper is in "paused" state, we will resume with the same parameters on the next mount) Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
9555c6c1 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add skip_balance mount option Since restriper kthread starts involuntarily on mount and can suck cpu and memory bandwidth add a mount option to forcefully skip it. The restriper in that case hangs around in paused state and can be resumed from userspace when it's convenient. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
59641015 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: recover balance on mount On mount, if balance item is found, resume balance in a separate kernel thread. Try to be smart to continue roughly where previous balance (or convert) was interrupted. For chunk types that were being converted to some profile we turn on soft convert, in case of a simple balance we turn on usage filter and relocate only less-than-90%-full chunks of that type. These are just heuristics but they help quite a bit, and can be improved in future. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
0940ebf6 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: save balance parameters to disk Introduce a new btree objectid for storing balance item. The reason is to be able to resume restriper after a crash with the same parameters. Balance item has a very high objectid and goes into tree of tree roots. The key for the new item is as follows: [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ] Older kernels simply ignore it so it's safe to mount with an older kernel and then go back to the newer one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
cfa4c961 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: soft profile changing mode (aka soft convert) When doing convert from one profile to another if soft mode is on restriper won't touch chunks that already have the profile we are converting to. This is useful if e.g. half of the FS was converted earlier. The soft mode switch is (like every other filter) per-type. This means that we can convert for example meta chunks the "hard" way while converting data chunks selectively with soft switch. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
e4d8ec0f |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: implement online profile changing Profile changing is done by launching a balance with BTRFS_BALANCE_CONVERT bits set and target fields of respective btrfs_balance_args structs initialized. Profile reducing code in this case will pick restriper's target profile if it's available instead of doing a blind reduce. If target profile is not yet available it goes back to a plain reduce. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
ea67176a |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: virtual address space subset filter Select chunks which have at least one byte located inside a given [vstart, vend) virtual address space range. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
94e60d5a |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: devid subset filter Select chunks which have at least one byte of at least one stripe located on a device with devid X in a given [pstart,pend) physical address range. This filter only works when devid filter is turned on. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
409d404b |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: devid filter Relocate chunks which have at least one stripe located on a device with devid X. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
5ce5b3c0 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: usage filter Select chunks that are less than X percent full. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
ed25e9b2 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: profiles filter Select chunks based on a given profile mask. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
f43ffb60 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add basic infrastructure for selective balancing This allows to have a separate set of filters for each chunk type (data,meta,sys). The code however is generic and switch on chunk type is only done once. This commit also adds a type filter: it allows to balance for example meta and system chunks w/o touching data ones. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
c9e9f97b |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add basic restriper infrastructure Add basic restriper infrastructure: extended balancing ioctl and all related ioctl data structures, add data structure for tracking restriper's state to fs_info, etc. The semantics of the old balancing ioctl are fully preserved. Explicitly disallow any volume operations when balance is in progress. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
52ba6929 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: introduce masks for chunk type and profile Chunk's type and profile are encoded in u64 flags field. Introduce masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_* constants, it should be ULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
6fef8df1 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: get rid of *_alloc_profile fields {data,metadata,system}_alloc_profile fields have been unused for a long time now. Get rid of them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
b367e47f |
|
06-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix possible deadlock when opening a seed device The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex, but when we mount a filesystem which has backing seed devices, we have this lock chain: open_ctree() lock(chunk_mutex); read_chunk_tree(); read_one_dev(); open_seed_devices(); lock(uuid_mutex); and then we hit a lockdep splat. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
ec9ef7a1 |
|
30-Nov-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: simplfy calculation of stripe length for discard operation For btrfs raid, while discarding a range of space, we'll need to know the start offset and length to discard for each device, and it's done in btrfs_map_block(). However the calculation is a bit complex for raid0 and raid10, so I reimplement it based on a fact that: dev1 dev2 dev3 (raid0) ----------------------------------- s0 s3 s6 s1 s4 s7 s2 s5 Each device has (total_stripes / nr_dev) stripes, or plus one. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
de11cc12 |
|
30-Nov-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: don't pre-allocate btrfs bio We pre-allocate a btrfs bio with fixed size, and then may re-allocate memory if we find stripes are bigger than the fixed size. But this pre-allocation is not necessary. Also we don't have to calcuate the stripe number twice. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
125ccb0a |
|
08-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: don't pass a trans handle unnecessarily in volumes.c Some functions never use the transaction handle passed to them. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
10f6327b |
|
17-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: fix a deadlock in btrfs_scan_one_device() pathname resolution under a global mutex, taken on some paths in ->mount() is a Bad Idea(tm) - think what happens if said pathname resolution triggers automount of some btrfs instance and walks into attempt to grab the same mutex. Deadlock - we are waiting for daemon to finish walking the path, daemon is waiting for us to release the mutex... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1100373f |
|
06-Jan-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use bigger metadata chunks on bigger filesystems The 256MB chunk is a little small on a huge FS. This scales up the chunk size. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
21adbd5c |
|
09-Nov-2011 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: integrate integrity check module into btrfs This is the last part of the patch series. It modifies the btrfs code to use the integrity check module if configured to do so with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set, the only effective change is that code is added that handles the mount option to activate the integrity check. If the mount option is set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code complains in the log and the mount fails with EINVAL. Add the mount option to activate the usage of the integrity check code. Add invocation of btrfs integrity check code init and cleanup function on mount and umount, respectively. Add hook to call btrfs integrity check code version of submit_bh/submit_bio. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
d85c8a6f |
|
15-Dec-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: unplug every once and a while The btrfs io submission threads can build up massive plug lists. This keeps things more reasonable so we don't hand over huge dumps of IO at once. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5dbc8fca |
|
09-Dec-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror btrfs_end_bio checks the number of errors on a bio against the max number of errors allowed before sending any EIOs up to the higher levels. If we got enough copies of the bio done for a given raid level, it is supposed to clear the bio error flag and return success. We have pointers to the original bio sent down by the higher layers and pointers to any cloned bios we made for raid purposes. If the original bio happens to be the one that got an io error, but not the last one to finish, it might not have the BIO_UPTODATE bit set. Then, when the last bio does finish, we'll call bio_end_io on the original bio. It won't have the uptodate bit set and we'll end up sending EIO to the higher layers. We already had a check for this, it just was conditional on getting the IO error on the very last bio. Make the check unconditional so we eat the EIOs properly. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a5d16333 |
|
07-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: check if the to-be-added device is writable If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding a readonly device to a btrfs filesystem, and btrfs will write to that device, emitting kernel errors: [ 3109.833692] lost page write due to I/O error on loop2 [ 3109.833720] lost page write due to I/O error on loop2 ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
924cd8fb |
|
10-Nov-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix nocow when deleting the item btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves, if we modify the result of it, the meta-data will be broken. fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6c41761f |
|
13-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: separate superblock items out of fs_info fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
20bcd649 |
|
19-Oct-2011 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: close all bdevs on mount failure Fix a bug introduced by 20b45077. We have to return EINVAL on mount failure, but doing that too early in the sequence leaves all of the devices opened exclusively. This also fixes an issue where under some scenarios only a second mount -o degraded <devices> command would succeed. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
2bf64758 |
|
26-Sep-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: allow us to overcommit our enospc reservations One of the things that kills us is the fact that our ENOSPC reservations are horribly over the top in most normal cases. There isn't too much that can be done about this because when we are completely full we really need them to work like this so we don't under reserve. However if there is plenty of unallocated chunks on the disk we can use that to gauge how much we can overcommit. So this patch adds chunk free space accounting so we always know how much unallocated space we have. Then if we fail to make a reservation within our allocated space, check to see if we can overcommit. In the normal flushing case (like with delalloc metadata reservations) we'll take the free space and divide it by 2 if our metadata profile is setup for DUP or any of those, and then divide it by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing case (we really need this reservation now!) we only limit ourselves to half of the free space. This makes this fio test [torrent] filename=torrent-test rw=randwrite size=4g ioengine=sync directory=/mnt/btrfs-test go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB file system. This doesn't seem to break my other enospc tests, but could really use some more testing as this is a super scary change. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
90519d66 |
|
23-May-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: state information for readahead Add state information for readahead to btrfs_fs_info and btrfs_device Changes v2: - don't wait in radix_trees - add own set of workers for readahead Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
2774b2ca |
|
15-Jun-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
btrfs: Put mirror_num in bi_bdev The error correction code wants to make sure that only the bad mirror is rewritten. Thus, we need to know which mirror is the bad one. I did not find a more apropriate field than bi_bdev. But I think using this is fine, because it is modified by the block layer, anyway, and should not be read after the bio returned. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
a1d3c478 |
|
04-Aug-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
btrfs: btrfs_multi_bio replaced with btrfs_bio btrfs_bio is a bio abstraction able to split and not complete after the last bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio tracks the mirror_num used to read data which can be used for error correction purposes. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
0e588859 |
|
05-Aug-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix uninitialized sync_pending sync_pending is uninitialized before it be used, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
38c01b96 |
|
01-Aug-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: fix a bug of balance on full multi-disk partitions When balancing, we'll first try to shrink devices for some space, but if it is working on a full multi-disk partition with raid protection, we may encounter a bug, that is, while shrinking, total_bytes may be less than bytes_used, and btrfs may allocate a dev extent that accesses out of device's bounds. Then we will not be able to write or read the data which stores at the end of the device, and get the followings: device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5 Btrfs detected SSD devices, enabling SSD mode btrfs: relocating block group 476315648 flags 9 btrfs: found 4 extents attempt to access beyond end of device sdb5: rw=145, want=546176, limit=546147 attempt to access beyond end of device sdb5: rw=145, want=546304, limit=546147 attempt to access beyond end of device sdb5: rw=145, want=546432, limit=546147 attempt to access beyond end of device sdb5: rw=145, want=546560, limit=546147 attempt to access beyond end of device Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d5e2003c |
|
04-Aug-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: detect wether a device supports discard We have a problem where if a user specifies discard but doesn't actually support it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem because this gets called (in a fashion) from the tree log recovery code, which has a nice little BUG_ON(ret) after it, which causes us to fail the tree log replay. So instead detect wether our devices support discard when we're adding them and then don't issue discards if we know that the device doesn't support it. And just for good measure set ret = 0 in btrfs_issue_discard just in case we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2ab1ba68 |
|
04-Aug-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: force unplugs when switching from high to regular priority bios Btrfs does bio submissions from a worker thread, and each device has a list of high priority bios and regular priority bios. Synchronous writes go to the high priority thread while async writes go to regular list. This commit brings back an explicit unplug any time we switch from high to regular priority, which makes it easier for the block layer to give us low latencies. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
85d4e461 |
|
26-Jul-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make a lockdep class for each root This patch was originally from Tejun Heo. lockdep complains about the btrfs locking because we sometimes take btree locks from two different trees at the same time. The current classes are based only on level in the btree, which isn't enough information for lockdep to figure out if the lock is safe. This patch makes a class for each type of tree, and lumps all the FS trees that actually have files and directories into the same class. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
92b8e897 |
|
12-Jul-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON alloc_path errors in find_next_chunk I also removed the BUG_ON from error return of find_next_chunk in init_first_rw_device(). It turns out that the only caller of init_first_rw_device() also BUGS on any nonzero return so no actual behavior change has occurred here. do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk() which can now return -ENOMEM. Instead of setting space_info->full on any error from btrfs_alloc_chunk() I catch and return every error value _except_ -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
|
#
17e9f796 |
|
12-Jul-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON alloc_path errors in btrfs_balance() Dealing with this seems trivial - the only caller of btrfs_balance() is btrfs_ioctl() which passes the error code directly back to userspace. There also isn't much state to unwind (if I'm wrong about this point, we can always safely move the allocation to the top of btrfs_balance() anyway). Signed-off-by: Mark Fasheh <mfasheh@suse.com>
|
#
508794eb |
|
02-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: don't panic if we get an error while balancing V2 A user reported an error where if we try to balance an fs after a device has been removed it will blow up. This is because we get an EIO back and this is where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks, Reported-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
22b63a29 |
|
09-Feb-2011 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs - use %pU to print fsid Get rid of FIXME comment. Uuids from dmesg are now the same as uuids given by btrfs-progs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f3f302a |
|
30-May-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: false BUG_ON when degraded In degraded mode the struct btrfs_device of missing devs don't have device->name set. A kstrdup of NULL correctly returns NULL. Don't BUG in this case. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1f78160c |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: using rcu lock in the reader side of devices list fs_devices->devices is only updated on remove and add device paths, so we can use rcu to protect it in the reader side Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
46224705 |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: drop unnecessary device lock Drop device_list_mutex for the reader side on clone_fs_devices and btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device list is not updated btrfs_close_extra_devices is the initialized path, we can not add or remove device at this time, so we can simply drop the mutex safely, like other initialized function does(add_missing_dev, __find_device, __btrfs_open_devices ...). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0c1daee0 |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: fix the race between remove dev and alloc chunk On remove device path, it updates device->dev_alloc_list but does not hold chunk lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c9513edb |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: fix the race between reading and updating devices On btrfs_congested_fn and __unplug_io_fn paths, we should hold device_list_mutex to avoid remove/add device path to update fs_devices->devices On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in fs_devices->devices or fs_devices->devices is updated, so we should hold the mutex to avoid the reader side to reach them Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4f6c9328 |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: fix bh leak on __btrfs_open_devices path 'bh' is forgot to release if no error is detected Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
65a246c5 |
|
18-May-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: return error code to caller when btrfs_del_item fails The error code is returned instead of calling BUG_ON when btrfs_del_item returns the error. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b0b802d7 |
|
19-May-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: return error code to caller when btrfs_previous_item fails The error code is returned instead of calling BUG_ON when btrfs_previous_item returns the error. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
73c5de00 |
|
11-Apr-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: quasi-round-robin for chunk allocation In a multi device setup, the chunk allocator currently always allocates chunks on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. This patch always sorts the devices before allocating, and allocates the stripes on the devices with the most available space, as long as there is enough space available. In a low space situation, it first tries to maximize striping. The patch also simplifies the allocator and reduces the checks for corner cases. The simplification is done by several means. First, it defines the properties of each RAID type upfront. These properties are used afterwards instead of differentiating cases in several places. Second, the old allocator defined a minimum stripe size for each block group type, tried to find a large enough chunk, and if this fails just allocates a smaller one. This is now done in one step. The largest possible chunk (up to max_chunk_size) is searched and allocated. Because we now have only one pass, the allocation of the map (struct map_lookup) is moved down to the point where the number of stripes is already known. This way we avoid reallocation of the map. We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
|
#
a9c9bf68 |
|
12-Apr-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: heed alloc_start currently alloc_start is disregarded if the requested chunk size is bigger than (device size - alloc_start), but smaller than the device size. The only situation where I see this could have made sense was when a chunk equal the size of the device has been requested. This was possible as the allocator failed to take alloc_start into account when calculating the request chunk size. As this gets fixed by this patch, the workaround is not necessary anymore.
|
#
bcd53741 |
|
12-Apr-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: move btrfs_cmp_device_free_bytes to super.c this function won't be used here anymore, so move it super.c where it is used for df-calculation
|
#
a2de733c |
|
08-Mar-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: scrub This adds an initial implementation for scrub. It works quite straightforward. The usermode issues an ioctl for each device in the fs. For each device, it enumerates the allocated device chunks. For each chunk, the contained extents are enumerated and the data checksums fetched. The extents are read sequentially and the checksums verified. If an error occurs (checksum or EIO), a good copy is searched for. If one is found, the bad copy will be rewritten. All enumerations happen from the commit roots. During a transaction commit, the scrubs get paused and afterwards continue from the new roots. This commit is based on the series originally posted to linux-btrfs with some improvements that resulted from comments from David Sterba, Ilya Dryomov and Jan Schmidt. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
f2a97a9d |
|
04-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: remove all unused functions Remove static and global declarations and/or definitions. Reduces size of btrfs.ko by ~3.4kB. text data bss dec hex filename 402081 7464 200 409745 64091 btrfs.ko.base 398620 7144 200 405964 631cc btrfs.ko.remove-all Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b3b4aa74 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop unused parameter from btrfs_release_path parameter tree root it's not used since commit 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer interface for large blocksizes") Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
172ddd60 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop gfp parameter from alloc_extent_map pass GFP_NOFS directly to kmem_cache_alloc Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
a8067e02 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop unused parameter from extent_map_tree_init the GFP flags are not stored anywhere and all allocations are done via alloc_extent_map(GFP_NOFS). Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
211588ad |
|
19-Apr-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: do some plugging in the submit_bio threads The Btrfs submit bio threads have a small number of threads responsible for pushing down bios we've collected for a large number of devices. Since we do all the bios for a single device at once, we want to make sure we unplug and send down the bios for each device as we're done processing them. The new plugging API removed the btrfs code to unplug while processing bios, this adds it back with the new API. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d9d04879 |
|
27-Mar-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix __btrfs_map_block on 32 bit machines Recent changes for discard support didn't compile, this fixes them not to try and % 64 bit numbers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fce3bb9a |
|
24-Mar-2011 |
Li Dongyang <lidongyang@novell.com> |
Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP btrfs_map_block() will only return a single stripe length, but we want the full extent be mapped to each disk when we are trimming the extent, so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1abe9b8a |
|
24-Mar-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: add initial tracepoint support for btrfs Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7eaceacc |
|
10-Mar-2011 |
Jens Axboe <jaxboe@fusionio.com> |
block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
fb01aa85 |
|
15-Feb-2011 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: set FMODE_EXCL in btrfs_device->mode This fixes a bug introduced in d4d77629, where the device added online (and therefore initialized via btrfs_init_new_device()) would be left with the positive bdev->bd_holders after unmount. Since d4d77629 we no longer OR FMODE_EXCL explicitly on blkdev_put(), set it in btrfs_device->mode. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9b3517e9 |
|
15-Feb-2011 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: make btrfs_rm_device() fail gracefully If shrinking done as part of the online device removal fails add that device back to the allocation list and increment the rw_devices counter. This fixes two bugs: 1) we could have a perfectly good device out of alloc list for no good reason; 2) in the btrfs consisting of two devices, failure in btrfs_rm_device() could lead to a situation where it was impossible to remove any of the devices because of the "unable to remove the only writeable device" error. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
67100f25 |
|
06-Feb-2011 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs - Fix memory leak in btrfs_init_new_device() Memory allocated by calling kstrdup() should be freed. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
98d5dc13 |
|
19-Jan-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
btrfs: fix return value check of btrfs_start_transaction() The error check of btrfs_start_transaction() is added, and the mistake of the error check on several places is corrected. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6f88a440 |
|
29-Dec-2010 |
Ben Hutchings <ben@decadent.org.uk> |
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire filesystem and may run uninterruptibly for a long time. This does not seem to be something that an unprivileged user should be able to do. Reported-by: Aron Xu <happyaron.xu@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
20b45077 |
|
08-Jan-2011 |
Dave Young <hidave.darkstar@gmail.com> |
btrfs: mount failure return value fix I happened to pass swap partition as root partition in cmdline, then kernel panic and tell me about "Cannot open root device". It is not correct, in fact it is a fs type mismatch instead of 'no device'. Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL. The logic in init/do_mounts.c: for (p = fs_names; *p; p += strlen(p)+1) { int err = do_mount_root(name, p, flags, root_mount_data); switch (err) { case 0: goto out; case -EACCES: flags |= MS_RDONLY; goto retry; case -EINVAL: continue; } print "Cannot open root device" panic } SO fs type after btrfs will have no chance to mount Here fix the return value as -EINVAL Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6d07bcec |
|
05-Jan-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: fix wrong free space information of btrfs When we store data by raid profile in btrfs with two or more different size disks, df command shows there is some free space in the filesystem, but the user can not write any data in fact, df command shows the wrong free space information of btrfs. # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10 # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 28.00KB devid 1 size 5.01GB used 2.03GB path /dev/sda9 devid 2 size 10.00GB used 2.01GB path /dev/sda10 # btrfs device scan /dev/sda9 /dev/sda10 # mount /dev/sda9 /mnt # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999 (fill the filesystem) # sync # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 3.99GB devid 1 size 5.01GB used 5.01GB path /dev/sda9 devid 2 size 10.00GB used 4.99GB path /dev/sda10 It is because btrfs cannot allocate chunks when one of the pairing disks has no space, the free space on the other disks can not be used for ever, and should be subtracted from the total space, but btrfs doesn't subtract this space from the total. It is strange to the user. This patch fixes it by calcing the free space that can be used to allocate chunks. Implementation: 1. get all the devices free space, and align them by stripe length. 2. sort the devices by the free space. 3. check the free space of the devices, 3.1. if it is not zero, and then check the number of the devices that has more free space than this device, if the number of the devices is beyond the min stripe number, the free space can be used, and add into total free space. if the number of the devices is below the min stripe number, we can not use the free space, the check ends. 3.2. if the free space is zero, check the next devices, goto 3.1 This implementation is just likely fake chunk allocation. After appling this patch, df can show correct space information: # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 0 100% /mnt Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b2117a39 |
|
05-Jan-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: make the chunk allocator utilize the devices better With this patch, we change the handling method when we can not get enough free extents with default size. Implementation: 1. Look up the suitable free extent on each device and keep the search result. If not find a suitable free extent, keep the max free extent 2. If we get enough suitable free extents with default size, chunk allocation succeeds. 3. If we can not get enough free extents, but the number of the extent with default size is >= min_stripes, we just change the mapping information (reduce the number of stripes in the extent map), and chunk allocation succeeds. 4. If the number of the extent with default size is < min_stripes, sort the devices by its max free extent's size descending 5. Use the size of the max free extent on the (num_stripes - 1)th device as the stripe size to allocate the device space By this way, the chunk allocator can allocate chunks as large as possible when the devices' space is not enough and make full use of the devices. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7bfc837d |
|
05-Jan-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: restructure find_free_dev_extent() - make it return the start position and length of the max free space when it can not find a suitable free space. - make it more readability Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1974a3b4 |
|
05-Jan-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: fix wrong calculation of stripe size There are two tiny problem: - One is When we check the chunk size is greater than the max chunk size or not, we should take mirrors into account, but the original code didn't. - The other is btrfs shouldn't use the size of the residual free space as the length of of a dup chunk when doing chunk allocation. It is because the device space that a dup chunk needs is twice as large as the chunk size, if we use the size of the residual free space as the length of a dup chunk, we can not get enough free space. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cd02dca5 |
|
13-Dec-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: account for missing devices in RAID allocation profiles When we mount in RAID degraded mode without adding a new device to replace the failed one, we can end up using the wrong RAID flags for allocations. This results in strange combinations of block groups (raid1 in a raid10 filesystem) and corruptions when we try to allocate blocks from single spindle chunks on drives that are actually missing. The first device has two small 4MB chunks in it that mkfs creates and these are usually unused in a raid1 or raid10 setup. But, in -o degraded, the allocator will fall back to these because the mask of desired raid groups isn't correct. The fix here is to count the missing devices as we build up the list of devices in the system. This count is used when picking the raid level to make sure we continue using the same levels that were in place before we lost a drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d4d77629 |
|
13-Nov-2010 |
Tejun Heo <tj@kernel.org> |
block: clean up blkdev_get() wrappers and their users After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>
|
#
e525fd89 |
|
13-Nov-2010 |
Tejun Heo <tj@kernel.org> |
block: make blkdev_get/put() handle exclusive access Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>
|
#
37004c42 |
|
13-Nov-2010 |
Tejun Heo <tj@kernel.org> |
btrfs: close_bdev_exclusive() should use the same @flags as the matching open_bdev_exclusive() In the failure path of __btrfs_open_devices(), close_bdev_exclusive() is called with @flags which doesn't match the one used during open_bdev_exclusive(). Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <chris.mason@oracle.com>
|
#
559af821 |
|
29-Oct-2010 |
Andi Kleen <andi@firstfloor.org> |
Btrfs: cleanup warnings from gcc 4.6 (nonbugs) These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
18e503d6 |
|
28-Oct-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix raid code for removing missing drives When btrfs is mounted in degraded mode, it has some internal structures to track the missing devices. This missing device is setup as readonly, but the mapping code can get upset when we try to write to it. This changes the mapping code to return -EIO instead of oops when we try to write to the readonly device. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c3b9a62c |
|
18-Aug-2010 |
Christoph Hellwig <hch@infradead.org> |
btrfs: replace barriers with explicit flush / FUA usage Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP detection for barriers and stop setting the barrier flag for discards. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
7b6d91da |
|
07-Aug-2010 |
Christoph Hellwig <hch@lst.de> |
block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
#
a22285a6 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Integrate metadata reservation with start_transaction Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9f680ce0 |
|
06-Apr-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make sure the chunk allocator doesn't create zero length chunks A recent commit allowed for smaller chunks to be created, but didn't make sure they were always bigger than a stripe. After some divides, this led to zero length stripes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0cad8a11 |
|
17-Mar-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix chunk allocate size calculation If the amount of free space left in a device is less than what we think should be the minimum size, just ignore the minimum size and use the amount we have. I ran into this running tests on a 600mb volume, the chunk allocator wouldn't let me allocate the last 52mb of the disk for data because we want to have at least 64mb chunks for data. This patch fixes that problem. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f3eae7e8 |
|
24-Mar-2010 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk() We can use this simple method to make source more readable. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ab59381e |
|
24-Mar-2010 |
Zhao Lei <zhaolei@cn.fujitsu.com> |
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree() We need to check return value of btrfs_search_slot() in btrfs_read_chunk_tree() and do corresponding error handing. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
5ff7ba3a |
|
15-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't look at bio flags after submit_bio After callling submit_bio, the bio can be freed at any time. The btrfs submission thread helper was checking the bio flags too late, which might not give the correct answer. When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a343832f |
|
06-Jan-2010 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
btrfs: using btrfs_stack_device_id() get devid We can use btrfs_stack_device_id() to get dev_item->devid Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3a0524dc |
|
08-Feb-2010 |
TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> |
btrfs: Update existing btrfs_device for renaming device When we scan devices in a multi-device filesystem, we memorize the original name. If the device gets a new name, later scans don't update the in-kernel structures related to it, and we're not able to mount the filesystem. This patch updates device name during scaning. Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
51684082 |
|
10-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: run the backing dev more often in the submit_bio helper The submit_bio helper thread can decide to loop back around to service more bios. This commit forces it to unplug first, which helps reduce the latency seen by submitters. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
035fe03a |
|
26-Jan-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: check total number of devices when removing missing If you have a disk failure in RAID1 and then add a new disk to the array, and then try to remove the missing volume, it will fail. The reason is the sanity check only looks at the total number of rw devices, which is just 2 because we have 2 good disks and 1 bad one. Instead check the total number of devices in the array to make sure we can actually remove the device. Tested this with a failed disk setup and with this test we can now run btrfs-vol -r missing /mount/point and it works fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7f59203a |
|
26-Jan-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: check return value of open_bdev_exclusive properly Hit this problem while testing RAID1 failure stuff. open_bdev_exclusive returns ERR_PTR(), not NULL. So change the return value properly. This is important if you accidently specify a device that doesn't exist when trying to add a new device to an array, you will panic the box dereferencing bdev. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f48b9075 |
|
26-Jan-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: do not mark the chunk as readonly if in degraded mode If a RAID setup has chunks that span multiple disks, and one of those disks has failed, btrfs_chunk_readonly will return 1 since one of the disks in that chunk's stripes is dead and therefore not writeable. So instead if we are in degraded mode, return 0 so we can go ahead and allocate stuff. Without this patch all of the block groups in a RAID1 setup will end up read-only, which will mean we can't add new disks to the array since we won't be able to make allocations. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2423fdfb |
|
06-Jan-2010 |
Jiri Slaby <jirislaby@kernel.org> |
Btrfs, fix memory leaks in error paths Stanse found 2 memory leaks in relocate_block_group and __btrfs_map_block. cluster and multi are not freed/assigned on all paths. Fix that. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
83d3c969 |
|
07-Dec-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: make metadata chunks smaller This patch makes us a bit less zealous about making sure we have enough free metadata space by pearing down the size of new metadata chunks to 256mb instead of 1gb. Also, we used to try an allocate metadata chunks when allocating data, but that sort of thing is done elsewhere now so we can just remove it. With my -ENOSPC test I used to have 3gb reserved for metadata out of 75gb, now I have 1.7gb. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fd2696f3 |
|
29-Sep-2009 |
Julia Lawall <julia@diku.dk> |
Btrfs: introduce missing kfree Error handling code following a kzalloc should free the allocated data. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,f1,l; position p1,p2; expression *ptr != NULL; @@ x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...); ... if (x == NULL) S <... when != x when != if (...) { <+...x...+> } ( x->f1 = E | (x->f1 == NULL || ...) | f(...,x->f1,...) ) ...> ( return \(0\|<+...x...+>\|ptr\); | return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ba1bf481 |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: make balance code choose more wisely when relocating Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
890871be |
|
02-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: switch extent_map to a rw lock There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
57fd5a5f |
|
07-Aug-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: tweak congestion backoff The btrfs io submission thread tries to back off congested devices in favor of rotating off to another disk. But, it tries to make sure it submits at least some IO before rotating on (the others may be congested too), and so it has a magic number of requests it tries to write before it hops. This makes the magic number smaller. Testing shows that we're spending too much time on congested devices and leaving the other devices idle. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1f98a13f |
|
11-Sep-2009 |
Jens Axboe <jens.axboe@oracle.com> |
bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
#
9779b72f |
|
24-Jul-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: find smallest available device extent during chunk allocation Allocating new block group is easy when the disk has plenty of space. But things get difficult as the disk fills up, especially if the FS has been run through btrfs-vol -b. The balance operation is likely to make the total bytes available on the device greater than the largest extent we'll actually be able to allocate. But the device extent allocation code incorrectly assumes that a device with 5G free will be able to allocate a 5G extent. It isn't normally a problem because device extents don't get freed unless btrfs-vol -b is run. This fixes the device extent allocator to remember the largest free extent it can find, and then uses that value as a fallback. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1fcbac58 |
|
24-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: find_free_dev_extent doesn't handle holes at the start of the device find_free_dev_extent does not properly handle the case where the device is not complete free, and there is a free extent at the beginning of the device. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3acada49 |
|
22-Jul-2009 |
David Woodhouse <dwmw2@infradead.org> |
Btrfs: Remove broken sanity check from btrfs_rmap_block() It was never actually doing anything anyway (see the loop condition), and it would be difficult to make it work for RAID[56]. Even if it was actually working, it's checking for the wrong thing anyway. Instead of checking whether we list a block which _doesn't_ land at the relevant physical location, it should be checking that we _have_ listed all the logical blocks which refer to the required physical location on all devices. This function is only called from remove_sb_from_cache() to ensure that we reserve the logical blocks which would reside at the same physical location as the superblock copies. So listing more blocks than we need is actually OK. With RAID[56] we're going to throw away an entire stripe for each block we have to ignore, so we _are_ going to list blocks other than the ones which actually contain the superblock. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bf1fb512 |
|
22-Jul-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: properly update space information after shrinking device. Change 'goto done' to 'break' for the case of all device extents have been freed, so that the code updates space information will be execute. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e5e9a520 |
|
10-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: avoid races between super writeout and device list updates On multi-device filesystems, btrfs writes supers to all of the devices before considering a sync complete. There wasn't any additional locking between super writeout and the device list management code because device management was done inside a transaction and super writeout only happened with no transation writers running. With the btrfs fsync log and other async transaction updates, this has been racey for some time. This adds a mutex to protect the device list. The existing volume mutex could not be reused due to transaction lock ordering requirements. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c289811c |
|
10-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: autodetect SSD devices During mount, btrfs will check the queue nonrot flag for all the devices found in the FS. If they are all non-rotating, SSD mode is enabled by default. If the FS was mounted with -o nossd, the non-rotating flag is ignored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d644d8a1 |
|
09-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: avoid IO stalls behind congested devices in a multi-device FS The btrfs IO submission threads try to service a bunch of devices with a small number of threads. They do a congestion check to try and avoid waiting on requests for a busy device. The checks make sure we've sent a few requests down to a given device just so that we aren't bouncing between busy devices without actually sending down any IO. The counter used to decide if we can switch to the next device is somewhat overloaded. It is also being used to decide if we've done a good batch of requests between the WRITE_SYNC or regular priority lists. It may get reset to zero often, leaving us hammering on a busy device instead of moving on to another disk. This commit adds a new counter for the number of bios sent while servicing a device. It doesn't get reset or fiddled with. On multi-device filesystems, this fixes IO stalls in streaming write workloads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d84275c9 |
|
09-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: don't allow WRITE_SYNC bios to starve out regular writes Btrfs uses dedicated threads to submit bios when checksumming is on, which allows us to make sure the threads dedicated to checksumming don't get stuck waiting for requests. For each btrfs device, there are two lists of bios. One list is for WRITE_SYNC bios and the other is for regular priority bios. The IO submission threads used to process all of the WRITE_SYNC bios first and then switch to the regular bios. This commit makes sure we don't completely starve the regular bios by rotating between the two lists. WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries to run in batches to avoid seeking. Benchmarking shows this eliminates stalls during streaming buffered writes on both multi-device and single device filesystems. If the regular bios starve, the system can end up with a large amount of ram pinned down in writeback pages. If we are a little more fair between the two classes, we're able to keep throughput up and make progress on the bulk of our dirty ram. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d4f98a2 |
|
10-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2cc3c559 |
|
04-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: set device->total_disk_bytes when adding new device It was not being properly initialized, and so the size saved to disk was not correct. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d6397bae |
|
27-Apr-2009 |
Chris Ball <cjb@laptop.org> |
Btrfs: When shrinking, only update disk size on success Previously, we updated a device's size prior to attempting a shrink operation. This patch moves the device resizing logic to only happen if the shrink completes successfully. In the process, it introduces a new field to btrfs_device -- disk_total_bytes -- to track the on-disk size. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ffbd517d |
|
20-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: use WRITE_SYNC for synchronous writes Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for writes we plan on waiting on in the near future. This patch mirrors recent changes in other filesystems and the generic code to use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for other latency critical writes. Btrfs uses async worker threads for checksumming before the write is done, and then again to actually submit the bios. The bio submission code just runs a per-device list of bios that need to be sent down the pipe. This list is split into low priority and high priority lists so the WRITE_SYNC IO happens first. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bedf762b |
|
03-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: unplug in the async bio submission threads Btrfs pages being written get set to writeback, and then may go through a number of steps before they hit the block layer. This includes compression, checksumming and async bio submission. The end result is that someone who writes a page and then does wait_on_page_writeback is likely to unplug the queue before the bio they cared about got there. We could fix this by marking bios sync, or by doing more frequent unplugs, but this commit just changes the async bio submission code to unplug after it has processed all the bios for a device. The async bio submission does a fair job of collection bios, so this shouldn't be a huge problem for reducing merging at the elevator. For streaming O_DIRECT writes on a 5 drive array, it boosts performance from 386MB/s to 460MB/s. Thanks to Hisashi Hifumi for helping with this work. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b765ead5 |
|
03-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: keep processing bios for a given bdev if our proc is batching Btrfs uses async helper threads to submit write bios so the checksumming helper threads don't block on the disk. The submit bio threads may process bios for more than one block device, so when they find one device congested they try to move on to other devices instead of blocking in get_request_wait for one device. This does a pretty good job of keeping multiple devices busy, but the congested flag has a number of problems. A congested device may still give you a request, and other procs that aren't backing off the congested device may starve you out. This commit uses the io_context stored in current to decide if our process has been made a batching process by the block layer. If so, it keeps sending IO down for at least one batch. This helps make sure we do a good amount of work each time we visit a bdev, and avoids large IO stalls in multi-device workloads. It's also very ugly. A better solution is in the works with Jens Axboe. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
913d952e |
|
10-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Clear space_info full when adding new devices The full flag on the space info structs tells the allocator not to try and allocate more chunks because the devices in the FS are fully allocated. When more devices are added, we need to clear the full flag so the allocator knows it has more space available. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4184ea7f |
|
09-Mar-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix locking around adding new space_info Storage allocated to different raid levels in btrfs is tracked by a btrfs_space_info structure, and all of the current space_infos are collected into a list_head. Most filesystems have 3 or 4 of these structs total, and the list is only changed when new raid levels are added or at unmount time. This commit adds rcu locking on the list head, and properly frees things at unmount time. It also clears the space_info->full flag whenever new space is added to the FS. The locking for the space info list goes like this: reads: protected by rcu_read_lock() writes: protected by the chunk_mutex At unmount time we don't need special locking because all the readers are gone. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4008c04a |
|
12-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make a lockdep class for the extent buffer locks Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3f3420df |
|
12-Feb-2009 |
Julia Lawall <julia@diku.dk> |
Btrfs: fs/btrfs/volumes.c: remove useless kzalloc The call to kzalloc is followed by a kmalloc whose result is stored in the same variable. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,l; position p1,p2; expression *ptr != NULL; @@ ( if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S | x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...); ... if (x == NULL) S ) <... when != x when != if (...) { <+...x...+> } x->f = E ...> ( return \(0\|<+...x...+>\|ptr\); | return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a6837051 |
|
04-Feb-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Catch missed bios in the async bio submission thread The async bio submission thread was missing some bios that were added after it had decided there was no work left to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c6e30871 |
|
21-Jan-2009 |
Qinghuang Feng <qhfeng.kernel@gmail.com> |
Btrfs: simplify iteration codes Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
119e10cf |
|
21-Jan-2009 |
Roland Dreier <rdreier@cisco.com> |
Btrfs: Remove extra KERN_INFO in the middle of a line The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device() actually follows another printk that doesn't end in a newline (since the intention is for the two printks to make one line of output), so the KERN_INFO just ends up messing up the output: device label exp <6>devid 1 transid 9 /dev/sda5 Fix this by changing the extra KERN_INFO to KERN_CONT. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7eaebe7d |
|
21-Jan-2009 |
Huang Weiyi <weiyi.huang@gmail.com> |
Btrfs: removed unused #include <version.h>'s Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1d9e2ae9 |
|
16-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Clear the device->running_pending flag before bailing on congestion Btrfs maintains a queue of async bio submissions so the checksumming threads don't have to wait on get_request_wait. In order to avoid extra wakeups, this code has a running_pending flag that is used to tell new submissions they don't need to wake the thread. When the threads notice congestion on a single device, they may decide to requeue the job and move on to other devices. This makes sure the running_pending flag is cleared before the job is requeued. It should help avoid IO stalls by making sure the task is woken up when new submissions come in. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d397712b |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e4404d6e |
|
12-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: shared seed device This patch makes seed device possible to be shared by multiple mounted file systems. The sharing is achieved by cloning seed device's btrfs_fs_devices structure. Thanks you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c3027eb5 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add inode sequence number for NFS and reserved space in a few structs This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
934d375b |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use map_private_extent_buffer during generic_bin_search It is possible that generic_bin_search will be called on a tree block that has not been locked. This happens because cache_block_block skips locking on the tree blocks. Since the tree block isn't locked, we aren't allowed to change the extent_buffer->map_token field. Using map_private_extent_buffer avoids any changes to the internal extent buffer fields. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a512bbf8 |
|
08-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: superblock duplication This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
d20f7043 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: move data checksumming into a dedicated tree Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
97288f2c |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: corret fmode_t annotations Make sure to propagate fmode_t properly and use the right constants for it. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
b2950863 |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: make things static and include the right headers Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
4b4e25f2 |
|
20-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: compat code fixes The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
15916de8 |
|
19-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fixes for 2.6.28-rc API changes * open/close_bdev_excl -> open/close_bdev_exclusive * blkdev_issue_discard takes a GFP mask now * Fix blkdev_issue_discard usage now that it is enabled Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7cbd8a83 |
|
12-Nov-2008 |
yanhai zhu <zhu.yanhai@gmail.com> |
Btrfs: Add a missing return pointer check Add a missing kzalloc() return pointer check in add_missing_dev(). Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b82032c |
|
17-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Seed device support Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
5f2cc086 |
|
07-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Avoid unplug storms during commit While doing a commit, btrfs makes sure all the metadata blocks were properly written to disk, calling wait_on_page_writeback for each page. This writeback happens after allowing another transaction to start, so it competes for the disk with other processes in the FS. If the page writeback bit is still set, each wait_on_page_writeback might trigger an unplug, even though the page might be waiting for checksumming to finish or might be waiting for the async work queue to submit the bio. This trades wait_on_page_writeback for waiting on the extent writeback bits. It won't trigger any unplugs and substantially improves performance in a number of workloads. This also changes the async bio submission to avoid requeueing if there is only one device. The requeue just wastes CPU time because there are no other devices to service. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
25179201 |
|
29-Oct-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
c8b97818 |
|
29-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add zlib compression support This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a62b9401 |
|
03-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: cast bio->bi_sector to a u64 before shifting On 32 bit machines without CONFIG_LBD, the bi_sector field is only 32 bits. Btrfs needs to cast it before shifting up, or we end up doing IO into the wrong place. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8c8bee1d |
|
29-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Wait for IO on the block device inodes of newly added devices btrfs-vol -a /dev/xxx will zero the first and last two MB of the device. The kernel code needs to wait for this IO to finish before it adds the device. btrfs metadata IO does not happen through the block device inode. A separate address space is used, allowing the zero filled buffer heads in the block device inode to be written to disk after FS metadata starts going down to the disk via the btrfs metadata inode. The end result is zero filled metadata blocks after adding new devices into the filesystem. The fix is a simple filemap_write_and_wait on the block device inode before actually inserting it into the pool of available devices. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1a40e23b |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: update space balancing code This patch updates the space balancing code to utilize the new backref format. Before, btrfs-vol -b would break any COW links on data blocks or metadata. This was slow and caused the amount of space used to explode if a large number of snapshots were present. The new code can keeps the sharing of all data extents and most of the tree blocks. To maintain the sharing of data extents, the space balance code uses a seperate inode hold data extent pointers, then updates the references to point to the new location. To maintain the sharing of tree blocks, the space balance code uses reloc trees to relocate tree blocks in reference counted roots. There is one reloc tree for each subvol, and all reloc trees share same root key objectid. Reloc trees are snapshots of the latest committed roots of subvols (root->commit_root). To relocate a tree block referenced by a subvol, there are two steps. COW the block through subvol's reloc tree, then update block pointer in the subvol to point to the new block. Since all reloc trees share same root key objectid, doing special handing for tree blocks owned by them is easy. Once a tree block has been COWed in one reloc tree, we can use the resulting new block directly when the same block is required to COW again through other reloc trees. In this way, relocated tree blocks are shared between reloc trees, so they are also shared between subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b1f55b0 |
|
24-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Remove Btrfs compat code for older kernels Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0f9dd46c |
|
23-Sep-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: free space accounting redo 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
325cd4ba |
|
05-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: properly set blocksize when adding new device. --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a1b32a59 |
|
05-Sep-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add debugging checks to track down corrupted metadata Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9473f16c |
|
28-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Throttle for async bio submits higher up the chain The current code waits for the count of async bio submits to get below a given threshold if it is too high right after adding the latest bio to the work queue. This isn't optimal because the caller may have sequential adjacent bios pending they are waiting to send down the pipe. This changeset requires the caller to wait on the async bio count, and changes the async checksumming submits to wait for async bios any time they self throttle. The end result is much higher sequential throughput. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b64a2851 |
|
20-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Wait for async bio submissions to make some progress at queue time Before, the btrfs bdi congestion function was used to test for too many async bios. This keeps that check to throttle pdflush, but also adds a check while queuing bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0986fe9e |
|
15-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Count async bios separately from async checksum work items Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7d2b4daa |
|
05-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix the multi-bio code to save the original bio for completion The multi-bio code is responsible for duplicating blocks in raid1 and single spindle duplication. It has counters to make sure all of the locations for a given extent are properly written before io completion is returned to the higher layers. But, it didn't always complete the same bio it was given, sometimes a clone was completed instead. This lead to problems with the async work queues because they saved a pointer to the bio in a struct off bi_private. The fix is to remember the original bio and only complete that one. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
492bb6de |
|
31-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Hold a reference on bios during submit_bio, add some extra bio checks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bcc63abb |
|
30-Jul-2008 |
Yan <zheng.yan@oracle.com> |
Btrfs: implement memory reclaim for leaf reference cache The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7d9eb12c |
|
08-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add locking around volume management (device add/remove/balance) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a74a4b97 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the transaction work queue with kthreads This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a2135011 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the big fs_mutex with a collection of other locks Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1cc127b5 |
|
12-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a thread pool just for submit_bio If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8b712842 |
|
11-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add async worker threads for pre and post IO checksumming Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ef3e66b |
|
24-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allocator fix variety pack * Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
515dc322 |
|
16-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use kzalloc on the fs_devices allocation Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6af5ac3c |
|
16-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Handle transid == 0 while opening devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a0af469b |
|
13-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Fix btrfs_open_devices to deal with changes since the scan ioctls Devices can change after the scan ioctls are done, and btrfs_open_devices needs to be able to verify them as they are opened and used by the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dfe25020 |
|
13-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add mount -o degraded to allow mounts to continue with missing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1259ab75 |
|
12-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Handle write errors on raid1 and raid10 When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
323da79c |
|
09-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Chunk relocation fine tuning, and add a few printks to show progress Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c1c4d91c |
|
08-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Only open block devices once during mount -o subvol= btrfs_open_devices needed a check to see if the device was already open. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a061fc8d |
|
07-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for online device removal This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
006a58a2 |
|
02-May-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Compile warning fixup in volume.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2fff734f |
|
29-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Tune stripe selection for raid1 and raid10 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a236aed1 |
|
29-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Deal with failed writes in mirrored configurations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4235298e |
|
28-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Drop some verbose printks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ec44a35c |
|
28-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add balance ioctl to restripe the chunks Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
788f20eb |
|
28-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add new ioctl to add devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8f18cf13 |
|
25-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make the resizer work based on shrinking and growing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
84eed90f |
|
25-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add failure handling for read_sys_array Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e1c4b745 |
|
22-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Fix btrfs_get_extent and get_block corner cases, and disable O_DIRECT reads The generic O_DIRECT code assumes all the bios have the same bdev, which isn't true for multi-device btrfs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b3075717 |
|
22-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add a special device list for chunk allocations This allows other code that needs to walk every device in the FS to do so without locking against allocations. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3c12ac72 |
|
20-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Simplify device selection for mirrored reads Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f2d8d74d |
|
21-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Make an unplug function that doesn't unplug every spindle Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ad5bd91e |
|
21-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add 1MB to the min_free in alloc_chunk This properly reflects the first 1MB we skip at the start of the device Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a40a90a0 |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix chunk allocation when some devices don't have enough room for stripes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9b3f68b9 |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Calculate appropriate chunk sizes for both small and large filesystems Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7ae9c09d |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for labels in the super block Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a443755f |
|
18-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Check device uuids along with devids Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7bf3b490 |
|
17-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Avoid 64 bit div for RAID10 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3b951516 |
|
17-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Use the extent map cache to find the logical disk block during data retries The data read retry code needs to find the logical disk block before it can resubmit new bios. But, finding this block isn't allowed to take the fs_mutex because that will deadlock with a number of different callers. This changes the retry code to use the extent map cache instead, but that requires the extent map cache to have the extent we're looking for. This is a problem because btrfs_drop_extent_cache just drops the entire extent instead of the little tiny part it is invalidating. The bulk of the code in this patch changes btrfs_drop_extent_cache to invalidate only a portion of the extent cache, and changes btrfs_get_extent to deal with the results. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
321aecc6 |
|
16-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add RAID10 support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e17cade2 |
|
15-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add chunk uuids and update multi-device back references Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b248a415 |
|
14-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: A few updates for 2.6.18 and versions older than 2.6.25 This includes fixing a missing spinlock init call that caused oops on mount for most kernels other than 2.6.25. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
73f61b2a |
|
11-Apr-2008 |
Miguel <miguel.filipe@gmail.com> |
Btrfs: bio_endio support for linux 2.6.23 and older. bio_endio() changed prototype on linux 2.6.24, support older kernels using the older prototype. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f2984462 |
|
10-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Write out all super blocks on commit, and bring back proper barrier support Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f188591e |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Retry metadata reads in the face of checksum failures Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cea9e445 |
|
09-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Change btrfs_map_block to return a structure with mappings for all stripes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
611f0e00 |
|
03-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for duplicate blocks on a single spindle Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8790d502 |
|
03-Apr-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for mirroring across drives Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e58ca020 |
|
01-Apr-2008 |
Yan <yanzheng@21cn.com> |
Fix btrfs_fill_super to return -EINVAL when no FS found Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
593060d7 |
|
25-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Implement raid0 when multiple devices are present Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8a4b83cc |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for device scanning and detection ioctls Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
239b14b3 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Bring back mount -o ssd optimizations Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0d81ba5d |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Move device information into the super block so it can be scanned Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6324fbf3 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Dynamic chunk and block group allocation Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0b86a832 |
|
24-Mar-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add support for multiple devices per filesystem Signed-off-by: Chris Mason <chris.mason@oracle.com>
|