#
74e97958 |
|
21-Mar-2024 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations Create subvolume, create snapshot and delete subvolume all use btrfs_subvolume_reserve_metadata() to reserve metadata for the changes done to the parent subvolume's fs tree, which cannot be mediated in the normal way via start_transaction. When quota groups (squota or qgroups) are enabled, this reserves qgroup metadata of type PREALLOC. Once the operation is associated to a transaction, we convert PREALLOC to PERTRANS, which gets cleared in bulk at the end of the transaction. However, the error paths of these three operations were not implementing this lifecycle correctly. They unconditionally converted the PREALLOC to PERTRANS in a generic cleanup step regardless of errors or whether the operation was fully associated to a transaction or not. This resulted in error paths occasionally converting this rsv to PERTRANS without calling record_root_in_trans successfully, which meant that unless that root got recorded in the transaction by some other thread, the end of the transaction would not free that root's PERTRANS, leaking it. Ultimately, this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount for the leaked reservation. The fix is to ensure that every qgroup PREALLOC reservation observes the following properties: 1. any failure before record_root_in_trans is called successfully results in freeing the PREALLOC reservation. 2. after record_root_in_trans, we convert to PERTRANS, and now the transaction owns freeing the reservation. This patch enforces those properties on the three operations. Without it, generic/269 with squotas enabled at mkfs time would fail in ~5-10 runs on my system. With this patch, it ran successfully 1000 times in a row. Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86211eea |
|
26-Feb-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: validate btrfs_qgroup_inherit parameter [BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0478adff |
|
14-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: factor out validation of btrfs_ioctl_vol_args_v2::name The validation of vol args v2 name in snapshot and device remove ioctls is not done properly. A terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ab2b180 |
|
14-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: factor out validation of btrfs_ioctl_vol_args::name The validation of vol args name in several ioctls is not done properly. a terminating NUL is written to the end of the buffer unconditionally, assuming that this would be the last place in case the buffer is used completely. This does not communicate back the actual error (either an invalid or too long path). Factor out all such cases and use a helper to do the verification, simply look for NUL in the buffer. There's no expected practical change, the size of buffer is 4088, this is enough for most paths or names. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
41044b41 |
|
14-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: add helper to get fs_info from struct inode pointer Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
16865702 |
|
19-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: handle directory and dentry mismatch in btrfs_may_delete() The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete() to verify various conditions before deletion. There's a BUG_ON added before linux.git started, we can turn it to a proper error handling at least in our local helper. A mistmatch between directory and the deleted dentry is clearly invalid. This won't be probably ever hit due to the way how the parameters are set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper lookup_one(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b712e3b |
|
25-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused included headers With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4e00422e |
|
16-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: replace sb::s_blocksize by fs_info::sectorsize The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ae061cf |
|
23-Jan-2024 |
Christian Brauner <brauner@kernel.org> |
btrfs: port device access to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-19-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
e2b54eaf |
|
23-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix double free of anonymous device after snapshot creation failure When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0c309d66 |
|
10-Jan-2024 |
Boris Burkov <boris@bur.io> |
btrfs: forbid creating subvol qgroups Creating a qgroup 0/subvolid leads to various races and it isn't helpful, because you can't specify a subvol id when creating a subvol, so you can't be sure it will be the right one. Any requirements on the automatic subvol can be gratified by using a higher level qgroup and the inheritance parameters of subvol creation. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
173431b2 |
|
09-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args Add extra sanity check for btrfs_ioctl_defrag_range_args::flags. This is not really to enhance fuzzing tests, but as a preparation for future expansion on btrfs_ioctl_defrag_range_args. In the future we're going to add new members, allowing more fine tuning for btrfs defrag. Without the -ENONOTSUPP error, there would be no way to detect if the kernel supports those new defrag features. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7081929a |
|
04-Jan-2024 |
Omar Sandoval <osandov@fb.com> |
btrfs: don't abort filesystem when attempting to snapshot deleted subvolume If the source file descriptor to the snapshot ioctl refers to a deleted subvolume, we get the following abort: BTRFS: Transaction aborted (error -2) WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs] Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs] RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027 RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840 RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998 R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80 FS: 00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0 Call Trace: <TASK> ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? __warn+0x81/0x130 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? report_bug+0x171/0x1a0 ? handle_bug+0x3a/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? create_pending_snapshot+0x1040/0x1190 [btrfs] ? create_pending_snapshot+0x1040/0x1190 [btrfs] create_pending_snapshots+0x92/0xc0 [btrfs] btrfs_commit_transaction+0x66b/0xf40 [btrfs] btrfs_mksubvol+0x301/0x4d0 [btrfs] btrfs_mksnapshot+0x80/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs] btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs] btrfs_ioctl+0x8a6/0x2650 [btrfs] ? kmem_cache_free+0x22/0x340 ? do_sys_openat2+0x97/0xe0 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x46/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe20abe83af RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry BTRFS info (device vdc: state EA): forced readonly BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction. BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry This happens because create_pending_snapshot() initializes the new root item as a copy of the source root item. This includes the refs field, which is 0 for a deleted subvolume. The call to btrfs_insert_root() therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then finds the root and returns -ENOENT if refs == 0, which causes create_pending_snapshot() to abort. Fix it by checking the source root's refs before attempting the snapshot, but after locking subvol_sem to avoid racing with deletion. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f4d8ad8 |
|
22-Nov-2023 |
Amir Goldstein <amir73il@gmail.com> |
btrfs: move file_start_write() to after permission hook In vfs code, file_start_write() is usually called after the permission hook in rw_verify_area(). btrfs_ioctl_encoded_write() in an exception to this rule. Move file_start_write() to after the rw_verify_area() check in encoded write to make the permission hook "start-write-safe". This is needed for fanotify "pre content" events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
a8892fd7 |
|
15-Dec-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not allow non subvolume root targets for snapshot Our btrfs subvolume snapshot <source> <destination> utility enforces that <source> is the root of the subvolume, however this isn't enforced in the kernel. Update the kernel to also enforce this limitation to avoid problems with other users of this ioctl that don't have the appropriate checks in place. Reported-by: Martin Michaelis <code@mgjm.de> CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5de0434b |
|
14-Nov-2023 |
David Sterba <dsterba@suse.com> |
btrfs: fix 64bit compat send ioctl arguments not initializing version member When the send protocol versioning was added in 5.16 e77fbf990316 ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was not updated (added by 2351f431f727 ("btrfs: fix send ioctl on 32bit with 64bit kernel")), missing the version struct member. The compat code is probably rarely used, nobody reported any bugs. Found by tool https://github.com/jirislaby/clang-struct . Fixes: e77fbf990316 ("btrfs: send: prepare for v2 protocol") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dec96fc2 |
|
13-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: use u64 for buffer sizes in the tree search ioctls In the tree search v2 ioctl we use the type size_t, which is an unsigned long, to track the buffer size in the local variable 'buf_size'. An unsigned long is 32 bits wide on a 32 bits architecture. The buffer size defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later try to copy the local variable 'buf_size' to the argument struct, when the search returns -EOVERFLOW, we copy only 32 bits which will be a problem on big endian systems. Fix this by using a u64 type for the buffer sizes, not only at btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain so that we can use the u64 at btrfs_ioctl_tree_search_v2(). Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86ec15d0 |
|
27-Sep-2023 |
Jan Kara <jack@suse.cz> |
btrfs: Convert to bdev_open_by_path() Convert btrfs to use bdev_open_by_path() and pass the handle around. We also drop the holder from struct btrfs_device as it is now not needed anymore. CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
ac6ea6a9 |
|
04-Oct-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: disable the device add feature for temp-fsid The device addition operation will transform the cloned temp-fsid mounted device into a multi-device filesystem. Therefore, it is marked as unsupported. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0124855f |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing last_trans_committed Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4a4f8fe2 |
|
04-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add and use helpers for reading and writing fs_info->generation Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8b9d0322 |
|
22-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove redundant root argument from btrfs_update_inode() The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
60ea105a |
|
21-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: track metadata relocation COW with simple quota Relocation COWs metadata blocks in two cases for the reloc root: - copying the subvolume root item when creating the reloc root - copying a btree node when there is a COW during relocation In both cases, the resulting btree node hits an abnormal code path with respect to the owner field in its btrfs_header. It first creates the root item for the new objectid, which populates the reloc root id, and it at this point that delayed refs are created. Later, it fully copies the old node into the new node (including the original owner field) which overwrites it. This results in a simple quotas mismatch where we run the delayed ref for the reloc root which has no simple quota effect (reloc root is not an fstree) but when we ultimately delete the node, the owner is the real original fstree and we do free the space. To work around this without tampering with the behavior of relocation, add a parameter to btrfs_add_tree_block that lets the relocation code path specify a different owning root than the "operating" root (in this case, owning root is the real root and the operating root is the reloc root). These can naturally be plumbed into delayed refs that have the same concept. Note that this is a double count in some sense, but a relatively natural one, as there are really two extents, and the old one will be deleted soon. This is consistent with how data relocation extents are accounted by simple quotas. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5343cd93 |
|
28-Mar-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: simple quota auto hierarchy for nested subvolumes Consider the following sequence: - enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/ With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100. Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in. In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes. I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
182940f4 |
|
16-May-2023 |
Boris Burkov <boris@bur.io> |
btrfs: qgroup: add new quota mode for simple quotas Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50564b65 |
|
12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9147b9de |
|
26-Sep-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c Jens reported the following warnings from -Wmaybe-uninitialized recent Linus' branch. In file included from ./include/asm-generic/rwonce.h:26, from ./arch/arm64/include/asm/rwonce.h:71, from ./include/linux/compiler.h:246, from ./include/linux/export.h:5, from ./include/linux/linkage.h:7, from ./include/linux/kernel.h:17, from fs/btrfs/ioctl.c:6: In function ‘instrument_copy_from_user_before’, inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3, inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7, inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6, inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10: ./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used uninitialized [-Wmaybe-uninitialized] 38 | #define kasan_check_write __kasan_check_write ./include/linux/instrumented.h:129:9: note: in expansion of macro ‘kasan_check_write’ 129 | kasan_check_write(to, n); | ^~~~~~~~~~~~~~~~~ ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’: ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const volatile void *’ to ‘__kasan_check_write’ declared here 20 | bool __kasan_check_write(const volatile void *p, unsigned int size); | ^~~~~~~~~~~~~~~~~~~ fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here 2981 | struct btrfs_ioctl_space_args space_args; | ^~~~~~~~~~ In function ‘instrument_copy_from_user_before’, inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3, inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7, inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9, inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10: ./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used uninitialized [-Wmaybe-uninitialized] 38 | #define kasan_check_write __kasan_check_write ./include/linux/instrumented.h:129:9: note: in expansion of macro ‘kasan_check_write’ 129 | kasan_check_write(to, n); | ^~~~~~~~~~~~~~~~~ ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’: ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const volatile void *’ to ‘__kasan_check_write’ declared here 20 | bool __kasan_check_write(const volatile void *p, unsigned int size); | ^~~~~~~~~~~~~~~~~~~ fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here 4341 | struct btrfs_ioctl_send_args_32 args32; | ^~~~~~ This was due to his config options and having KASAN turned on, which adds some extra checks around copy_from_user(), which then triggered the -Wmaybe-uninitialized checker for these cases. Fix the warnings by initializing the different structs we're copying into. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee34a82e |
|
26-Aug-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: release path before inode lookup during the ino lookup ioctl During the ino lookup ioctl we can end up calling btrfs_iget() to get an inode reference while we are holding on a root's btree. If btrfs_iget() needs to lookup the inode from the root's btree, because it's not currently loaded in memory, then it will need to lock another or the same path in the same root btree. This may result in a deadlock and trigger the following lockdep splat: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted ------------------------------------------------------ syz-executor277/5012 is trying to acquire lock: ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 but task is already holding lock: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302 btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955 btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline] btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338 btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline] open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494 btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154 btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 fc_mount fs/namespace.c:1112 [inline] vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142 btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 do_new_mount+0x28f/0xae0 fs/namespace.c:3335 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-tree-01){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(btrfs-tree-00); lock(btrfs-tree-01); lock(btrfs-tree-00); rlock(btrfs-tree-01); *** DEADLOCK *** 1 lock held by syz-executor277/5012: #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 stack backtrace: CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0bec94ea39 Fix this simply by releasing the path before calling btrfs_iget() as at point we don't need the path anymore. Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/ Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2a9462de |
|
05-Jul-2023 |
Jeff Layton <jlayton@kernel.org> |
btrfs: convert to ctime accessor functions In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-27-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
edc72881 |
|
11-May-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: trigger orphan inode cleanup during START_SYNC ioctl There is an internal error report that scrub found an error in an orphan inode's data. However there are very limited ways to cleanup such orphan inodes: - btrfs_start_pre_rw_mount() This happens at either mount, or RO->RW switch. This is not a viable solution for root fs which may not be unmounted or RO mounted. Furthermore this doesn't cover every subvolume, it only covers the currently cached subvolumes. - btrfs_lookup_dentry() This happens when we first lookup the subvolume dentry. But dentry can be cached thus it's not ensured to be triggered every time. - create_snapshot() This only happens for the created snapshot, not the source one. This means if we didn't trigger orphan items cleanup, there is really no other way to manually trigger it. Add this step to the START_SYNC ioctl. This is a slight change in the semantics of the ioctl but as sync can be potentially slow and is usually paired with WAIT_SYNC ioctl. The errors are not handled because the main point of the ioctl is the async commit, orphan cleanup is a side effect. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
12df6a62 |
|
02-May-2023 |
Tom Rix <trix@redhat.com> |
btrfs: simplify transid initialization in btrfs_ioctl_wait_sync A small code simplification, move the default value of transid to its initialization and remove the else-statement. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b53e51a |
|
11-Apr-2023 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: don't commit transaction for every subvol create Recently a Meta-internal workload encountered subvolume creation taking up to 2s each, significantly slower than directory creation. As they were hoping to be able to use subvolumes instead of directories, and were looking to create hundreds, this was a significant issue. After Josef investigated, it turned out to be due to the transaction commit currently performed at the end of subvolume creation. This change improves the workload by not doing transaction commit for every subvolume creation, and merely requiring a transaction commit on fsync. In the worst case, of doing a subvolume create and fsync in a loop, this should require an equal amount of time to the current scheme; and in the best case, the internal workload creating hundreds of subvolumes before fsyncing is greatly improved. While it would be nice to be able to use the log tree and use the normal fsync path, log tree replay can't deal with new subvolume inodes presently. It's possible that there's some reason that the transaction commit is necessary for correctness during subvolume creation; however, git logs indicate that the commit dates back to the beginning of subvolume creation, and there are no notes on why it would be necessary. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2736e8ee |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: use the holder as indication for exclusive opens The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
ac868bc9 |
|
13-Apr-2023 |
xiaoshoukui <xiaoshoukui@gmail.com> |
btrfs: fix assertion of exclop condition when starting balance Balance as exclusive state is compatible with paused balance and device add, which makes some things more complicated. The assertion of valid states when starting from paused balance needs to take into account two more states, the combinations can be hit when there are several threads racing to start balance and device add. This won't typically happen when the commands are started from command line. Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE. Concurrently adding multiple devices to the same mount point and btrfs_exclop_finish executed finishes before assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_NONE state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD, in fs/btrfs/ioctl.c:456 Call Trace: <TASK> btrfs_exclop_balance+0x13c/0x310 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED. Concurrently adding multiple devices to the same mount point and btrfs_exclop_balance executed finish before the latter thread execute assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, fs/btrfs/ioctl.c:458 Call Trace: <TASK> btrfs_exclop_balance+0x240/0x410 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd An example of the failed assertion is below, which shows that the paused balance is also needed to be checked. root@syzkaller:/home/xsk# ./repro Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0 Failed to add device /dev/vda, errno 14 [ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1 Failed to add device /dev/vda, errno 14 [ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458 [ 416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7 [ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 416.652502][ T7984] PKRU: 55555554 [ 416.652888][ T7984] Call Trace: [ 416.653241][ T7984] <TASK> [ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410 [ 416.654036][ T7984] ? memdup_user+0xab/0xc0 [ 416.654465][ T7984] ? PTR_ERR+0x17/0x20 [ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320 [ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0 [ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80 [ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210 [ 416.656874][ T7984] do_syscall_64+0x3c/0xb0 [ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 416.657922][ T7984] RIP: 0033:0x4546af [ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af [ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003 [ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f [ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640 [ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000 [ 416.664703][ T7984] </TASK> [ 416.665040][ T7984] Modules linked in: [ 416.665590][ T7984] ---[ end trace 0000000000000000 ]--- [ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
604e6681 |
|
05-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: scrub: reject unsupported scrub flags Since the introduction of scrub interface, the only flag that we support is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are some undefined flags passed in, we just ignore them. This is problematic if we want to introduce new scrub flags, as we have no way to determine if such flags are supported. Address the problem by introducing a check for the flags, and if unsupported flags are set, return -EOPNOTSUPP to inform the user space. This check should be backported for all supported kernels before any new scrub flags are introduced. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2f1a6be1 |
|
22-Mar-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between quota disable and quota assign ioctls The quota assign ioctl can currently run in parallel with a quota disable ioctl call. The assign ioctl uses the quota root, while the disable ioctl frees that root, and therefore we can have a use-after-free triggered in the assign ioctl, leading to a trace like the following when KASAN is enabled: [672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0 [672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736 [672.724][T736] [672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37 [672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [672.727][T736] Call Trace: [672.728][T736] <TASK> [672.728][T736] dump_stack_lvl+0xd9/0x150 [672.725][T736] print_report+0xc1/0x5e0 [672.720][T736] ? __virt_addr_valid+0x61/0x2e0 [672.727][T736] ? __phys_addr+0xc9/0x150 [672.725][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.722][T736] kasan_report+0xc0/0xf0 [672.729][T736] ? btrfs_search_slot+0x2962/0x2db0 [672.724][T736] btrfs_search_slot+0x2962/0x2db0 [672.723][T736] ? fs_reclaim_acquire+0xba/0x160 [672.722][T736] ? split_leaf+0x13d0/0x13d0 [672.726][T736] ? rcu_is_watching+0x12/0xb0 [672.723][T736] ? kmem_cache_alloc+0x338/0x3c0 [672.722][T736] update_qgroup_status_item+0xf7/0x320 [672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0 [672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0 [672.730][T736] ? spin_bug+0x1d0/0x1d0 [672.737][T736] btrfs_run_qgroups+0x5de/0x840 [672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70 [672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00 [672.738][T736] btrfs_ioctl+0x3d58/0x5d80 [672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550 [672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0 [672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90 [672.734][T736] ? do_vfs_ioctl+0x132/0x1660 [672.730][T736] ? vfs_fileattr_set+0xc40/0xc40 [672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50 [672.732][T736] ? sigprocmask+0xf2/0x340 [672.737][T736] ? __fget_files+0x26a/0x480 [672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10 [672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50 [672.736][T736] __x64_sys_ioctl+0x198/0x210 [672.736][T736] do_syscall_64+0x39/0xb0 [672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.739][T736] RIP: 0033:0x4556ad [672.742][T736] </TASK> [672.743][T736] [672.748][T736] Allocated by task 27677: [672.743][T736] kasan_save_stack+0x22/0x40 [672.741][T736] kasan_set_track+0x25/0x30 [672.741][T736] __kasan_kmalloc+0xa4/0xb0 [672.749][T736] btrfs_alloc_root+0x48/0x90 [672.746][T736] btrfs_create_tree+0x146/0xa20 [672.744][T736] btrfs_quota_enable+0x461/0x1d20 [672.743][T736] btrfs_ioctl+0x4a1c/0x5d80 [672.747][T736] __x64_sys_ioctl+0x198/0x210 [672.749][T736] do_syscall_64+0x39/0xb0 [672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.756][T736] [672.757][T736] Freed by task 27677: [672.759][T736] kasan_save_stack+0x22/0x40 [672.759][T736] kasan_set_track+0x25/0x30 [672.756][T736] kasan_save_free_info+0x2e/0x50 [672.751][T736] ____kasan_slab_free+0x162/0x1c0 [672.758][T736] slab_free_freelist_hook+0x89/0x1c0 [672.752][T736] __kmem_cache_free+0xaf/0x2e0 [672.752][T736] btrfs_put_root+0x1ff/0x2b0 [672.759][T736] btrfs_quota_disable+0x80a/0xbc0 [672.752][T736] btrfs_ioctl+0x3e5f/0x5d80 [672.756][T736] __x64_sys_ioctl+0x198/0x210 [672.753][T736] do_syscall_64+0x39/0xb0 [672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.769][T736] [672.768][T736] The buggy address belongs to the object at ffff888022ec0000 [672.768][T736] which belongs to the cache kmalloc-4k of size 4096 [672.769][T736] The buggy address is located 520 bytes inside of [672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000) [672.760][T736] [672.764][T736] The buggy address belongs to the physical page: [672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0 [672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) [672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002 [672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 [672.771][T736] page dumped because: kasan: bad access detected [672.778][T736] page_owner tracks the page as allocated [672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88 [672.779][T736] get_page_from_freelist+0x119c/0x2d50 [672.779][T736] __alloc_pages+0x1cb/0x4a0 [672.776][T736] alloc_pages+0x1aa/0x270 [672.773][T736] allocate_slab+0x260/0x390 [672.771][T736] ___slab_alloc+0xa9a/0x13e0 [672.778][T736] __slab_alloc.constprop.0+0x56/0xb0 [672.771][T736] __kmem_cache_alloc_node+0x136/0x320 [672.789][T736] __kmalloc+0x4e/0x1a0 [672.783][T736] tomoyo_realpath_from_path+0xc3/0x600 [672.781][T736] tomoyo_path_perm+0x22f/0x420 [672.782][T736] tomoyo_path_unlink+0x92/0xd0 [672.780][T736] security_path_unlink+0xdb/0x150 [672.788][T736] do_unlinkat+0x377/0x680 [672.788][T736] __x64_sys_unlink+0xca/0x110 [672.789][T736] do_syscall_64+0x39/0xb0 [672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.784][T736] page last free stack trace: [672.787][T736] free_pcp_prepare+0x4e5/0x920 [672.787][T736] free_unref_page+0x1d/0x4e0 [672.784][T736] __unfreeze_partials+0x17c/0x1a0 [672.797][T736] qlist_free_all+0x6a/0x180 [672.796][T736] kasan_quarantine_reduce+0x189/0x1d0 [672.797][T736] __kasan_slab_alloc+0x64/0x90 [672.793][T736] kmem_cache_alloc+0x17c/0x3c0 [672.799][T736] getname_flags.part.0+0x50/0x4e0 [672.799][T736] getname_flags+0x9e/0xe0 [672.792][T736] vfs_fstatat+0x77/0xb0 [672.791][T736] __do_sys_newlstat+0x84/0x100 [672.798][T736] do_syscall_64+0x39/0xb0 [672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd [672.790][T736] [672.791][T736] Memory state around the buggy address: [672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ^ [672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex before calling btrfs_run_qgroups(), which is what all qgroup ioctls should call. Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2943868a |
|
11-Feb-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: ioctl: return device fsid from DEV_INFO ioctl Currently user space utilizes dev info ioctl to grab the info of a certain devid, this includes its device uuid. But the returned info is not enough to determine if a device is a seed. Commit a26d60dedf9a ("btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device") exports the same value in sysfs so this is for parity with ioctl. Add a new member, fsid, into btrfs_ioctl_dev_info_args, and populate the member with fsid value. This should not cause any compatibility problem, following the combinations: - Old user space, old kernel - Old user space, new kernel User space tool won't even check the new member. - New user space, old kernel The kernel won't touch the new member, and user space tool should zero out its argument, thus the new member is all zero. User space tool can then know the kernel doesn't support this fsid reporting, and falls back to whatever they can. - New user space, new kernel Go as planned. Would find the fsid member is no longer zero, and trust its value. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
190a8339 |
|
26-Jan-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirty btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ed25dab3 |
|
26-Jan-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add trans argument to btrfs_clean_tree_block We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9452e93e |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port privilege checking helpers to mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
01beba79 |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port inode_owner_or_capable() to mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
f2d40141 |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port inode_init_owner() to mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
4609e1f1 |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port ->permission() to pass mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
8782a9ae |
|
12-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port ->fileattr_set() to pass mnt_idmap Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
#
63d5429f |
|
19-Nov-2022 |
Artem Chernyshev <artem.chernyshev@red-soft.ru> |
btrfs: replace strncpy() with strscpy() Using strncpy() on NUL-terminated strings are deprecated. To avoid possible forming of non-terminated string strscpy() should be used. Found by Linux Verification Center (linuxtesting.org) with SVACE. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cb3e217b |
|
12-Nov-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: use btrfs_dev_name() helper to handle missing devices better [BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c4f91e2 |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_delete_subvolume The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5d4d75b |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_inode_unlock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
29b6352b |
|
26-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: pass btrfs_inode to btrfs_inode_lock The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c03b2207 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move super prototypes into super.h Move these out of ctree.h into super.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2fc6822c |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move scrub prototypes into scrub.h Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
af142b6f |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move file prototypes to file.h Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7572dec8 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move ioctl prototypes into ioctl.h Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7a03b52 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move uuid tree prototypes to uuid-tree.h Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f2b39277 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move dir-item prototypes into dir-item.h Move these prototypes out of ctree.h and into their own header file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
59b818e0 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move defrag related prototypes to their own header Now that the defrag code is all in one file, create a defrag.h and move all the defrag related prototypes and helper out of ctree.h and into defrag.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a6a01ca6 |
|
26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move the file defrag code into defrag.c This is the other big portion of defrag code that has existed in ioctl.c. Move it to its new home in defrag.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43dd529a |
|
27-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: update function comments Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45c40c8f |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move root tree prototypes to their own header Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a0231804 |
|
24-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6db75318 |
|
19-Oct-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: use struct fscrypt_str instead of struct qstr While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e43eec81 |
|
19-Oct-2022 |
Sweet Tea Dorminy <sweettea-kernel@dorminy.me> |
btrfs: use struct qstr instead of name and namelen pairs Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
07e81dc9 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7f13d42 |
|
19-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b307f06d |
|
18-Oct-2022 |
David Sterba <dsterba@suse.com> |
btrfs: simplify generation check in btrfs_get_dentry Callers that pass non-zero generation always want to perform the generation check, we can simply encode that in one parameter and drop check_generation. Add function documentation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de4eda9d |
|
15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
013c1c55 |
|
09-Nov-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: free btrfs_path before copying subvol info to userspace btrfs_ioctl_get_subvol_info() frees the search path after the userspace copy from the temp buffer @subvol_info. This can lead to a lock splat warning. Fix this by freeing the path before we copy it to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8cf96b40 |
|
09-Nov-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: free btrfs_path before copying fspath to userspace btrfs_ioctl_ino_to_path() frees the search path after the userspace copy from the temp buffer @ipath->fspath. Which potentially can lead to a lock splat warning. Fix this by freeing the path before we copy it to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
418ffb9e |
|
09-Nov-2022 |
Anand Jain <anand.jain@oracle.com> |
btrfs: free btrfs_path before copying inodes to userspace btrfs_ioctl_logical_to_ino() frees the search path after the userspace copy from the temp buffer @inodes. Which potentially can lead to a lock splat. Fix this by freeing the path before we copy @inodes to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b740d806 |
|
07-Nov-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: free btrfs_path before copying root refs to userspace Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: syzbot+4ef9e52e464c6ff47d9d@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bd015294 |
|
09-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
570eb97b |
|
09-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: unify the lock/unlock extent variants We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dbbf4992 |
|
09-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove the wake argument from clear_extent_bits This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7601566 |
|
08-Jul-2022 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Use fs_info->max_extent_size also in get_extent_max_capacity() for the completeness. This is only used for defrag and not really necessary to fix the metadata reservation size. But, it still suppresses unnecessary defrag operations. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e3059ec0 |
|
06-Jun-2022 |
David Sterba <dsterba@suse.com> |
btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino There's only one function we pass to iterate_inodes_from_logical as iterator, so we can drop the indirection and call it directly, after moving the function to backref.c Signed-off-by: David Sterba <dsterba@suse.com>
|
#
099aa972 |
|
05-May-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance This eliminates 2 labels and makes the code generally more streamlined. Also rename the 'out_bargs' label to 'out_unlock' since bargs is going to be freed under the 'out' label. This also fixes a memory leak since bargs wasn't correctly freed in one of the condition which are now moved in btrfs_try_lock_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7fb10ed8 |
|
03-May-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: introduce btrfs_try_lock_balance This function contains the factored out locking sequence of btrfs_ioctl_balance. Having this piece of code separate helps to simplify btrfs_ioctl_balance which has too complicated. This will be used in the next patch to streamline the logic in btrfs_ioctl_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8101a0c |
|
09-May-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: allow defrag to convert inline extents to regular extents Btrfs defaults to max_inline=2K to make small writes inlined into metadata. The default value is always a win, as even DUP/RAID1/RAID10 doubles the metadata usage, it should still cause less physical space used compared to a 4K regular extents. But since the introduction of RAID1C3 and RAID1C4 it's no longer the case, users may find inlined extents causing too much space wasted, and want to convert those inlined extents back to regular extents. Unfortunately defrag will unconditionally skip all inline extents, no matter if the user is trying to converting them back to regular extents. So this patch will add a small exception for defrag_collect_targets() to allow defragging inline extents, if and only if the inlined extents are larger than max_inline, allowing users to convert them to regular ones. This also allows us to defrag extents like the following: item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69 generation 7 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 1 (zlib) item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53 generation 7 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 16384 ram 16384 extent compression 1 (zlib) Previously we're unable to do any defrag, since the first extent is inlined, and the second one has no extent to merge. Now we can defrag it to just one single extent, saving 48 bytes metadata space. item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13635584 nr 4096 extent data offset 0 nr 20480 ram 20480 extent compression 1 (zlib) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d031dc4 |
|
31-Mar-2022 |
Yu Zhe <yuzhe@nfschina.com> |
btrfs: remove unnecessary type casts Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8645462 |
|
29-Mar-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: simplify code flow in btrfs_ioctl_balance Move code in btrfs_ioctl_balance to simplify its flow. This is possible thanks to the removal of balance v1 ioctl and ensuring 'arg' argument is always present. First move the code duplicating the userspace arg to the kernel 'barg'. This makes the out_unlock label redundant. Secondly, check the validity of bargs::flags before copying to the dynamically allocated 'bctl'. This removes the need for the out_bctl label. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39864601 |
|
29-Mar-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove checks for arg argument in btrfs_ioctl_balance With the removal of balance v1 ioctl the 'arg' argument is guaranteed to be present so simply remove all conditional code which checks for its presence. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
caae78e0 |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: move common inode creation code into btrfs_create_new_inode() All of our inode creation code paths duplicate the calls to btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation additionally duplicates property inheritance and the call to btrfs_set_inode_index(). Fix this by moving the common code into btrfs_create_new_inode(). This accomplishes a few things at once: 1. It reduces code duplication. 2. It allows us to set up the inode completely before inserting the inode item, removing calls to btrfs_update_inode(). 3. It fixes a leak of an inode on disk in some error cases. For example, in btrfs_create(), if btrfs_new_inode() succeeds, then we have inserted an inode item and its inode ref. However, if something after that fails (e.g., btrfs_init_inode_security()), then we end the transaction and then decrement the link count on the inode. If the transaction is committed and the system crashes before the failed inode is deleted, then we leak that inode on disk. Instead, this refactoring aborts the transaction when we can't recover more gracefully. 4. It exposes various ways that subvolume creation diverges from mkdir in terms of inheriting flags, properties, permissions, and POSIX ACLs, a lot of which appears to be accidental. This patch explicitly does _not_ change the existing non-standard behavior, but it makes those differences more clear in the code and documents them so that we can discuss whether they should be changed. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3538d68dbd |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: reserve correct number of items for inode creation The various inode creation code paths do not account for the compression property, POSIX ACLs, or the parent inode item when starting a transaction. Fix it by refactoring all of these code paths to use a new function, btrfs_new_inode_prepare(), which computes the correct number of items. To do so, it needs to know whether POSIX ACLs will be created, so move the ACL creation into that function. To reduce the number of arguments that need to be passed around for inode creation, define struct btrfs_new_inode_args containing all of the relevant information. btrfs_new_inode_prepare() will also be a good place to set up the fscrypt context and encrypted filename in the future. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a1fd0c35 |
|
14-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: allocate inode outside of btrfs_new_inode() Instead of calling new_inode() and inode_init_owner() inside of btrfs_new_inode(), do it in the callers. This allows us to pass in just the inode instead of the mnt_userns and mode and removes the need for memalloc_nofs_{save,restores}() since we do it before starting a transaction. In create_subvol(), it also means we no longer have to look up the inode again to instantiate it. This also paves the way for some more cleanups in later patches. This also removes the comments about Smack checking i_op, which are no longer true since commit 5d6c31910bc0 ("xattr: Add __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags & IOP_XATTR, which is set based on sb->s_xattr. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70dc55f4 |
|
09-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: remove redundant name and name_len parameters to create_subvol The passed dentry already contains the name. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2256e901 |
|
09-Mar-2022 |
Omar Sandoval <osandov@fb.com> |
btrfs: fix anon_dev leak in create_subvol() When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or btrfs_insert_root() fail in create_subvol(), we return without freeing anon_dev. Reorganize the error handling in create_subvol() to fix this. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fb12489b |
|
29-Apr-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: Convert btrfs to read_folio This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
#
18788e34 |
|
23-Apr-2022 |
Catalin Marinas <catalin.marinas@arm.com> |
btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") addressed a lockdep warning by pre-faulting the user pages and attempting the copy_to_user_nofault() in an infinite loop. On architectures like arm64 with MTE, an access may fault within a page at a location different from what fault_in_writeable() probed. Since the sk_offset is rewound to the previous struct btrfs_ioctl_search_header boundary, there is no guaranteed forward progress and search_ioctl() may live-lock. Use fault_in_subpage_writeable() instead of fault_in_writeable() to ensure the permission is checked at the right granularity (smaller than PAGE_SIZE). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Fixes: a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220423100751.1870771-4-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
#
7b47ef52 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: add a bdev_discard_granularity helper Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
70200574 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
d03ae0d3 |
|
29-Mar-2022 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove support of balance v1 ioctl It was scheduled for removal in kernel v5.18 commit 6c405b24097c ("btrfs: deprecate BTRFS_IOC_BALANCE ioctl") thus its time has come. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
75a36a7d |
|
15-Mar-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: avoid defragging extents whose next extents are not targets [BUG] There is a report that autodefrag is defragging single sector, which is completely waste of IO, and no help for defragging: btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096 [CAUSE] In defrag_collect_targets(), we check if the current range (A) can be merged with next one (B). If mergeable, we will add range A into target for defrag. However there is a catch for autodefrag, when checking mergeability against range B, we intentionally pass 0 as @newer_than, hoping to get a higher chance to merge with the next extent. But in the next iteration, range B will looked up by defrag_lookup_extent(), with non-zero @newer_than. And if range B is not really newer, it will rejected directly, causing only range A being defragged, while we expect to defrag both range A and B. [FIX] Since the root cause is the difference in check condition of defrag_check_next_extent() and defrag_collect_targets(), we fix it by: 1. Pass @newer_than to defrag_check_next_extent() 2. Pass @extent_thresh to defrag_check_next_extent() This makes the check between defrag_collect_targets() and defrag_check_next_extent() more consistent. While there is still some minor difference, the remaining checks are focus on runtime flags like writeback/delalloc, which are mostly transient and safe to be checked only in defrag_collect_targets(). Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856 CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c0c7269 |
|
13-Aug-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: add BTRFS_IOC_ENCODED_WRITE The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1881fba8 |
|
09-Oct-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: add BTRFS_IOC_ENCODED_READ ioctl There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we fill in zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - We don't do read repair, because it turns out that read repair is currently broken for compressed data. - We hold the inode lock during the operation. Note that we don't need to hold the mmap lock. We may race with btrfs_page_mkwrite() and read the old data from before the page was dirtied: btrfs_page_mkwrite btrfs_encoded_read --------------------------------------------------- (enter) (enter) btrfs_wait_ordered_range lock_extent_bits btrfs_page_set_dirty unlock_extent_cached (exit) lock_extent_bits read extent (dirty page hasn't been flushed, so this is the old data) unlock_extent_cached (exit) we read the old data from before the page was dirtied. But, that's true even if we were to hold the mmap lock: btrfs_page_mkwrite btrfs_encoded_read ------------------------------------------------------------------- (enter) (enter) btrfs_inode_lock(BTRFS_ILOCK_MMAP) down_read(i_mmap_lock) (blocked) btrfs_wait_ordered_range lock_extent_bits read extent (page hasn't been dirtied, so this is the old data) unlock_extent_cached btrfs_inode_unlock(BTRFS_ILOCK_MMAP) down_read(i_mmap_lock) returns lock_extent_bits btrfs_page_set_dirty unlock_extent_cached In other words, this is inherently racy, so it's fine that we return the old data in this tiny window. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a55e65b8 |
|
01-Feb-2022 |
David Sterba <dsterba@suse.com> |
btrfs: replace BUILD_BUG_ON by static_assert The static_assert introduced in 6bab69c65013 ("build_bug.h: add wrapper for _Static_assert") has been supported by compilers for a long time (gcc 4.6, clang 3.0) and can be used in header files. We don't need to put BUILD_BUG_ON to random functions but rather keep it next to the definition. The exception here is the UAPI header btrfs_tree.h that could be potentially included by userspace code and the static assert is not defined (nor used in any other header). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
813febdb |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: disable snapshot creation/deletion for extent tree v2 When we stop tracking metadata blocks all of snapshotting will break, so disable it until I add the snapshot root and drop tree support. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da32c6d5 |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: disable scrub for extent-tree-v2 Scrub depends on extent references for every block, and with extent tree v2 we won't have that, so disable scrub until we can add back the proper code to handle extent-tree-v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
914a519b |
|
15-Dec-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: disable device manipulation ioctl's EXTENT_TREE_V2 Device add, remove, and replace all require balance, which doesn't work right now on extent tree v2, so disable these for now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9ad12305 |
|
15-Jan-2022 |
Sahil Kang <sahil.kang@asilaycomputing.com> |
btrfs: reuse existing inode from btrfs_ioctl btrfs_ioctl extracts inode from file so we can pass that into the callbacks. Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dc408ccd |
|
05-Jan-2022 |
Sahil Kang <sahil.kang@asilaycomputing.com> |
btrfs: reuse existing pointers from btrfs_ioctl btrfs_ioctl already contains pointers to the inode and btrfs_root structs, so we can pass them into the subfunctions instead of the toplevel struct file. Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
199257a7 |
|
10-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: don't use merged extent map for their generation check For extent maps, if they are not compressed extents and are adjacent by logical addresses and file offsets, they can be merged into one larger extent map. Such merged extent map will have the higher generation of all the original ones. But this brings a problem for autodefrag, as it relies on accurate extent_map::generation to determine if one extent should be defragged. For merged extent maps, their higher generation can mark some older extents to be defragged while the original extent map doesn't meet the minimal generation threshold. Thus this will cause extra IO. So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to indicate if the extent map is merged from one or more ems. And for autodefrag, if we find a merged extent map, and its generation meets the generation requirement, we just don't use this one, and go back to defrag_get_extent() to read extent maps from subvolume trees. This could cause more read IO, but should result less defrag data write, so in the long run it should be a win for autodefrag. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d5633b0d |
|
10-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: bring back the old file extent search behavior For defrag, we don't really want to use btrfs_get_extent() to iterate all extent maps of an inode. The reasons are: - btrfs_get_extent() can merge extent maps And the result em has the higher generation of the two, causing defrag to mark unnecessary part of such merged large extent map. This in fact can result extra IO for autodefrag in v5.16+ kernels. However this patch is not going to completely solve the problem, as one can still using read() to trigger extent map reading, and got them merged. The completely solution for the extent map merging generation problem will come as an standalone fix. - btrfs_get_extent() caches the extent map result Normally it's fine, but for defrag the target range may not get another read/write for a long long time. Such cache would only increase the memory usage. - btrfs_get_extent() doesn't skip older extent map Unlike the old find_new_extent() which uses btrfs_search_forward() to skip the older subtree, thus it will pick up unnecessary extent maps. This patch will fix the regression by introducing defrag_get_extent() to replace the btrfs_get_extent() call. This helper will: - Not cache the file extent we found It will search the file extent and manually convert it to em. - Use btrfs_search_forward() to skip entire ranges which is modified in the past This should reduce the IO for autodefrag. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
550f133f |
|
28-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: remove an ambiguous condition for rejection From the very beginning of btrfs defrag, there is a check to reject extents which meet both conditions: - Physically adjacent We may want to defrag physically adjacent extents to reduce the number of extents or the size of subvolume tree. - Larger than 128K This may be there for compressed extents, but unfortunately 128K is exactly the max capacity for compressed extents. And the check is > 128K, thus it never rejects compressed extents. Furthermore, the compressed extent capacity bug is fixed by previous patch, there is no reason for that check anymore. The original check has a very small ranges to reject (the target extent size is > 128K, and default extent threshold is 256K), and for compressed extent it doesn't work at all. So it's better just to remove the rejection, and allow us to defrag physically adjacent extents. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
979b25c3 |
|
28-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: don't defrag extents which are already at max capacity [BUG] For compressed extents, defrag ioctl will always try to defrag any compressed extents, wasting not only IO but also CPU time to compress/decompress: mkfs.btrfs -f $DEV mount -o compress $DEV $MNT xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar sync xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar sync echo "=== before ===" xfs_io -c "fiemap -v" $MNT/foobar btrfs filesystem defrag $MNT/foobar sync echo "=== after ===" xfs_io -c "fiemap -v" $MNT/foobar Then it shows the 2 128K extents just get COW for no extra benefit, with extra IO/CPU spent: === before === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26624..26879 256 0x8 1: [256..511]: 26632..26887 256 0x9 === after === /mnt/btrfs/file1: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..255]: 26640..26895 256 0x8 1: [256..511]: 26648..26903 256 0x9 This affects not only v5.16 (after the defrag rework), but also v5.15 (before the defrag rework). [CAUSE] From the very beginning, btrfs defrag never checks if one extent is already at its max capacity (128K for compressed extents, 128M otherwise). And the default extent size threshold is 256K, which is already beyond the compressed extent max size. This means, by default btrfs defrag ioctl will mark all compressed extent which is not adjacent to a hole/preallocated range for defrag. [FIX] Introduce a helper to grab the maximum extent size, and then in defrag_collect_targets() and defrag_check_next_extent(), reject extents which are already at their max capacity. Reported-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7093f152 |
|
28-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: don't try to merge regular extents with preallocated extents [BUG] With older kernels (before v5.16), btrfs will defrag preallocated extents. While with newer kernels (v5.16 and newer) btrfs will not defrag preallocated extents, but it will defrag the extent just before the preallocated extent, even it's just a single sector. This can be exposed by the following small script: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file xfs_io -c "fiemap -v" $mnt/file btrfs fi defrag $mnt/file sync xfs_io -c "fiemap -v" $mnt/file The output looks like this on older kernels: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..39]: 26664..26703 40 0x1 Which defrags the single sector along with the preallocated extent, and replace them with an regular extent into a new location (caused by data COW). This wastes most of the data IO just for the preallocated range. On the other hand, v5.16 is slightly better: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26664..26671 8 0x0 1: [8..39]: 26632..26663 32 0x801 The preallocated range is not defragged, but the sector before it still gets defragged, which has no need for it. [CAUSE] One of the function reused by the old and new behavior is defrag_check_next_extent(), it will determine if we should defrag current extent by checking the next one. It only checks if the next extent is a hole or inlined, but it doesn't check if it's preallocated. On the other hand, out of the function, both old and new kernel will reject preallocated extents. Such inconsistent behavior causes above behavior. [FIX] - Also check if next extent is preallocated If so, don't defrag current extent. - Add comments for each branch why we reject the extent This will reduce the IO caused by defrag ioctl and autodefrag. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
966d879b |
|
10-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target In the rework of btrfs_defrag_file(), we always call defrag_one_cluster() and increase the offset by cluster size, which is only 256K. But there are cases where we have a large extent (e.g. 128M) which doesn't need to be defragged at all. Before the refactor, we can directly skip the range, but now we have to scan that extent map again and again until the cluster moves after the non-target extent. Fix the problem by allow defrag_one_cluster() to increase btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if the last extent of the cluster is not a target. The test script looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt # As btrfs ioctl uses 32M as extent_threshold xfs_io -f -c "pwrite 0 64M" $mnt/file1 sync # Some fragemented range to defrag xfs_io -s -c "pwrite 65548k 4k" \ -c "pwrite 65544k 4k" \ -c "pwrite 65540k 4k" \ -c "pwrite 65536k 4k" \ $mnt/file1 sync echo "=== before ===" xfs_io -c "fiemap -v" $mnt/file1 echo "=== after ===" btrfs fi defrag $mnt/file1 sync xfs_io -c "fiemap -v" $mnt/file1 umount $mnt With extra ftrace put into defrag_one_cluster(), before the patch it would result tons of loops: (As defrag_one_cluster() is inlined, the function name is its caller) btrfs-126062 [005] ..... 4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144 btrfs-126062 [005] ..... 4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144 btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144 ... btrfs-126062 [005] ..... 4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144 But with this patch there will be just one loop, then directly to the end of the extent: btrfs-130471 [014] ..... 5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144 btrfs-130471 [014] ..... 5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384 CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d1ffa22 |
|
07-Feb-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: don't try to defrag extents which are under writeback Once we start writeback (have called btrfs_run_delalloc_range()), we allocate an extent, create an extent map point to that extent, with a generation of (u64)-1, created the ordered extent and then clear the DELALLOC bit from the range in the inode's io tree. Such extent map can pass the first call of defrag_collect_targets(), as its generation is (u64)-1, meets any possible minimal generation check. And the range will not have DELALLOC bit, also passing the DELALLOC bit check. It will only be re-checked in the second call of defrag_collect_targets(), which will wait for writeback. But at that stage we have already spent our time waiting for some IO we may or may not want to defrag. Let's reject such extents early so we won't waste our time. CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ea0eba69 |
|
30-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: don't hold CPU for too long when defragging a file There is a user report about "btrfs filesystem defrag" causing 120s timeout problem. For btrfs_defrag_file() it will iterate all file extents if called from defrag ioctl, thus it can take a long time. There is no reason not to release the CPU during such a long operation. Add cond_resched() after defragged one cluster. CC: stable@vger.kernel.org # 5.16 Link: https://lore.kernel.org/linux-btrfs/10e51417-2203-f0a4-2021-86c8511cc367@gmx.com Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
37b45995 |
|
21-Jan-2022 |
Tom Rix <trix@redhat.com> |
btrfs: fix use of uninitialized variable at rm device ioctl Clang static analysis reports this problem ioctl.c:3333:8: warning: 3rd function call argument is an uninitialized value ret = exclop_start_or_cancel_reloc(fs_info, cancel is only set in one branch of an if-check and is always used. So initialize to false. Fixes: 1a15eb724aae ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28b21c55 |
|
21-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix use-after-free after failure to create a snapshot At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and then attach it to the transaction's list of pending snapshots. After that we call btrfs_commit_transaction(), and if that returns an error we jump to 'fail' label, where we kfree() the pending snapshot structure. This can result in a later use-after-free of the pending snapshot: 1) We allocated the pending snapshot and added it to the transaction's list of pending snapshots; 2) We call btrfs_commit_transaction(), and it fails either at the first call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups(). In both cases, we don't abort the transaction and we release our transaction handle. We jump to the 'fail' label and free the pending snapshot structure. We return with the pending snapshot still in the transaction's list; 3) Another task commits the transaction. This time there's no error at all, and then during the transaction commit it accesses a pointer to the pending snapshot structure that the snapshot creation task has already freed, resulting in a user-after-free. This issue could actually be detected by smatch, which produced the following warning: fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list So fix this by not having the snapshot creation ioctl directly add the pending snapshot to the transaction's list. Instead add the pending snapshot to the transaction handle, and then at btrfs_commit_transaction() we add the snapshot to the list only when we can guarantee that any error returned after that point will result in a transaction abort, in which case the ioctl code can safely free the pending snapshot and no one can access it anymore. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a37d9a17 |
|
20-Jan-2022 |
Amir Goldstein <amir73il@gmail.com> |
fsnotify: invalidate dcache before IN_DELETE event Apparently, there are some applications that use IN_DELETE event as an invalidation mechanism and expect that if they try to open a file with the name reported with the delete event, that it should not contain the content of the deleted file. Commit 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify will have access to a positive dentry. This allowed a race where opening the deleted file via cached dentry is now possible after receiving the IN_DELETE event. To fix the regression, create a new hook fsnotify_delete() that takes the unlinked inode as an argument and use a helper d_delete_notify() to pin the inode, so we can pass it to fsnotify_delete() after d_delete(). Backporting hint: this regression is from v5.3. Although patch will apply with only trivial conflicts to v5.4 and v5.10, it won't build, because fsnotify_delete() implementation is different in each of those versions (see fsnotify_link()). A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo filesystem that do not need to call d_delete(). Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com Reported-by: Ivan Delalande <colona@arista.com> Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/ Fixes: 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
27cdfde1 |
|
20-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: update writeback index when starting defrag When starting a defrag, we should update the writeback index of the inode's mapping in case it currently has a value beyond the start of the range we are defragging. This can help performance and often result in getting less extents after writeback - for e.g., if the current value of the writeback index sits somewhere in the middle of a range that gets dirty by the defrag, then after writeback we can get two smaller extents instead of a single, larger extent. We used to have this before the refactoring in 5.16, but it was removed without any reason to do so. Originally it was added in kernel 3.1, by commit 2a0f7f5769992b ("Btrfs: fix recursive auto-defrag"), in order to fix a loop with autodefrag resulting in dirtying and writing pages over and over, but some testing on current code did not show that happening, at least with the test described in that commit. So add back the behaviour, as at the very least it is a nice to have optimization. Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") CC: stable@vger.kernel.org # 5.16 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c9d31c7 |
|
20-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add back missing dirty page rate limiting to defrag A defrag operation can dirty a lot of pages, specially if operating on the entire file or a large file range. Any task dirtying pages should periodically call balance_dirty_pages_ratelimited(), as stated in that function's comments, otherwise they can leave too many dirty pages in the system. This is what we did before the refactoring in 5.16, and it should have remained, just like in the buffered write path and relocation. So restore that behaviour. Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0cb5950f |
|
20-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when reserving space during defrag When defragging we can end up collecting a range for defrag that has already pages under delalloc (dirty), as long as the respective extent map for their range is not mapped to a hole, a prealloc extent or the extent map is from an old generation. Most of the time that is harmless from a functional perspective at least, however it can result in a deadlock: 1) At defrag_collect_targets() we find an extent map that meets all requirements but there's delalloc for the range it covers, and we add its range to list of ranges to defrag; 2) The defrag_collect_targets() function is called at defrag_one_range(), after it locked a range that overlaps the range of the extent map; 3) At defrag_one_range(), while the range is still locked, we call defrag_one_locked_target() for the range associated to the extent map we collected at step 1); 4) Then finally at defrag_one_locked_target() we do a call to btrfs_delalloc_reserve_space(), which will reserve data and metadata space. If the space reservations can not be satisfied right away, the flusher might be kicked in and start flushing delalloc and wait for the respective ordered extents to complete. If this happens we will deadlock, because both flushing delalloc and finishing an ordered extent, requires locking the range in the inode's io tree, which was already locked at defrag_collect_targets(). So fix this by skipping extent maps for which there's already delalloc. Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c080b414 |
|
18-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: properly update range->start for autodefrag [BUG] After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") autodefrag no longer properly re-defrag the file from previously finished location. [CAUSE] The recent refactoring of defrag only focuses on defrag ioctl subpage support, doesn't take autodefrag into consideration. There are two problems involved which prevents autodefrag to restart its scan: - No range.start update Previously when one defrag target is found, range->start will be updated to indicate where next search should start from. But now btrfs_defrag_file() doesn't update it anymore, making all autodefrag to rescan from file offset 0. This would also make autodefrag to mark the same range dirty again and again, causing extra IO. - No proper quick exit for defrag_one_cluster() Currently if we reached or exceed @max_sectors limit, we just exit defrag_one_cluster(), and let next defrag_one_cluster() call to do a quick exit. This makes @cur increase, thus no way to properly know which range is defragged and which range is skipped. [FIX] The fix involves two modifications: - Update range->start to next cluster start This is a little different from the old behavior. Previously range->start is updated to the next defrag target. But in the end, the behavior should still be pretty much the same, as now we skip to next defrag target inside btrfs_defrag_file(). Thus if auto-defrag determines to re-scan, then we still do the skip, just at a different timing. - Make defrag_one_cluster() to return >0 to indicate a quick exit So that btrfs_defrag_file() can also do a quick exit, without increasing @cur to the range end, and re-use @cur to update @range->start. - Add comment for btrfs_defrag_file() to mention the range->start update Currently only autodefrag utilize this behavior, as defrag ioctl won't set @max_to_defrag parameter, thus unless interrupted it will always try to defrag the whole range. Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
484167da |
|
18-Jan-2022 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: fix wrong number of defragged sectors [BUG] There are users using autodefrag mount option reporting obvious increase in IO: > If I compare the write average (in total, I don't have it per process) > when taking idle periods on the same machine: > Linux 5.16: > without autodefrag: ~ 10KiB/s > with autodefrag: between 1 and 2MiB/s. > > Linux 5.15: > with autodefrag:~ 10KiB/s (around the same as without > autodefrag on 5.16) [CAUSE] When autodefrag mount option is enabled, btrfs_defrag_file() will be called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many sectors we can defrag in one try. And then use the number of sectors defragged to determine if we need to re-defrag. But commit b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster") uses wrong unit to increase @sectors_defragged, which should be in unit of sector, not byte. This means, if we have defragged any sector, then @sectors_defragged will be >= sectorsize (normally 4096), which is larger than BTRFS_DEFRAG_BATCH. This makes the @max_sectors check in defrag_one_cluster() to underflow, rendering the whole @max_sectors check useless. Thus causing way more IO for autodefrag mount options, as now there is no limit on how many sectors can really be defragged. [FIX] Fix the problems by: - Use sector as unit when increasing @sectors_defragged - Include @sectors_defragged > @max_sectors case to break the loop - Add extra comment on the return value of btrfs_defrag_file() Reported-by: Anthony Ruhier <aruhier@mailbox.org> Fixes: b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b767c2fc |
|
18-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: allow defrag to be interruptible During defrag, at btrfs_defrag_file(), we have this loop that iterates over a file range in steps no larger than 256K subranges. If the range is too long, there's no way to interrupt it. So make the loop check in each iteration if there's signal pending, and if there is, break and return -AGAIN to userspace. Before kernel 5.16, we used to allow defrag to be cancelled through a signal, but that was lost with commit 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()"). This change adds back the possibility to cancel a defrag with a signal and keeps the same semantics, returning -EAGAIN to user space (and not the usually more expected -EINTR). This is also motivated by a recent bug on 5.16 where defragging a 1 byte file resulted in iterating from file range 0 to (u64)-1, as hitting the bug triggered a too long loop, basically requiring one to reboot the machine, as it was not possible to cancel defrag. Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b34cd8e |
|
17-Jan-2022 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix too long loop when defragging a 1 byte file When attempting to defrag a file with a single byte, we can end up in a too long loop, which is nearly infinite because at btrfs_defrag_file() we end up with the variable last_byte assigned with a value of 18446744073709551615 (which is (u64)-1). The problem comes from the fact we end up doing: last_byte = round_up(last_byte, fs_info->sectorsize) - 1; So if last_byte was assigned 0, which is i_size - 1, we underflow and end up with the value 18446744073709551615. This is trivial to reproduce and the following script triggers it: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT echo -n "X" > $MNT/foobar btrfs filesystem defragment $MNT/foobar umount $MNT So fix this by not decrementing last_byte by 1 before doing the sector size round up. Also, to make it easier to follow, make the round up right after computing last_byte. Reported-by: Anthony Ruhier <aruhier@mailbox.org> Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/ CC: stable@vger.kernel.org # 5.16 Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1b58ae0e |
|
13-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: skip transaction commit after failure to create subvolume At ioctl.c:create_subvol(), when we fail to create a subvolume we always commit the transaction. In most cases this is a no-op, since all the error paths, except for one, abort the transaction - the only exception is when we fail to insert the new root item into the root tree, in that case we don't abort the transaction because we didn't do anything that is irreversible - however we end up committing the transaction which although is not a functional problem, it adds unnecessary rotation of the backup roots in the superblock and unnecessary work. So change that to commit a transaction only when no error happened, otherwise just call btrfs_end_transaction() to release our reference on the transaction. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a174c0a2 |
|
25-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: allow device add if balance is paused Currently paused balance precludes adding a device since they are both considered exclusive ops and we can have at most one running at a time. This is problematic in case a filesystem encounters an ENOSPC situation while balance is running, in this case the only thing the user can do is mount the fs with "skip_balance" which pauses balance and delete some data to free up space for balance. However, it should be possible to add a new device when balance is paused. Fix this by allowing device add to proceed when balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
621a1ee1 |
|
25-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make device add compatible with paused balance in btrfs_exclop_start_try_lock This is needed to enable device add to work in cases when a file system has been mounted with 'skip_balance' mount option. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efc0e69c |
|
25-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: introduce exclusive operation BALANCE_PAUSED state Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fdfbf020 |
|
05-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rework async transaction committing Currently we do this awful thing where we get another ref on a trans handle, async off that handle and commit the transaction from that work. Because we do this we have to mess with current->journal_info and the freeze counting stuff. We already have an async thing to kick for the transaction commit, the transaction kthread. Replace this work struct with a flag on the fs_info to tell the kthread to go ahead and commit even if it's before our timeout. Then we can drastically simplify the async transaction commit path. Note: this can be simplified and functionality based on the pending operation COMMIT. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add note ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3212fa14 |
|
21-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the _nr from the item helpers Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
212a58fd |
|
13-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix warning when freeing leaf after subvolume creation failure When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the root item for the new subvolume into the root tree, we can trigger the following warning: [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.743344] Modules linked in: [78961.749440] dm_snapshot dm_thin_pool (...) [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1 [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.778398] Code: 17 00 48 85 (...) [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202 [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000 [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000 [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8 [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000 [78961.786620] FS: 00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000 [78961.787717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0 [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [78961.791684] Call Trace: [78961.792082] <TASK> [78961.792359] create_subvol+0x5d1/0x9a0 [btrfs] [78961.793054] btrfs_mksubvol+0x447/0x4c0 [btrfs] [78961.794009] ? preempt_count_add+0x49/0xa0 [78961.794705] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [78961.795712] ? _copy_from_user+0x66/0xa0 [78961.796382] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [78961.797392] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [78961.798172] ? __slab_free+0x10a/0x360 [78961.798820] ? rcu_read_lock_sched_held+0x12/0x60 [78961.799664] ? lock_release+0x223/0x4a0 [78961.800321] ? lock_acquired+0x19f/0x420 [78961.800992] ? rcu_read_lock_sched_held+0x12/0x60 [78961.801796] ? trace_hardirqs_on+0x1b/0xe0 [78961.802495] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [78961.803358] ? kmem_cache_free+0x321/0x3c0 [78961.804071] ? __x64_sys_ioctl+0x83/0xb0 [78961.804711] __x64_sys_ioctl+0x83/0xb0 [78961.805348] do_syscall_64+0x3b/0xc0 [78961.805969] entry_SYSCALL_64_after_hwframe+0x44/0xae [78961.806830] RIP: 0033:0x7f31284bc957 [78961.807517] Code: 3c 1c 48 f7 d8 (...) This is because we are calling btrfs_free_tree_block() on an extent buffer that is dirty. Fix that by cleaning the extent buffer, with btrfs_clean_tree_block(), before freeing it. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a163608 |
|
13-Dec-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix invalid delayed ref after subvolume creation failure When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d815b3f2 |
|
16-Nov-2021 |
Dan Carpenter <dan.carpenter@oracle.com> |
btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2() If memdup_user() fails the error handing will crash when it tries to kfree() an error pointer. Just return directly because there is no cleanup required. Fixes: 1a15eb724aae ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6c405b24 |
|
10-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: deprecate BTRFS_IOC_BALANCE ioctl The v2 balance ioctl has been introduced more than 9 years ago. Users of the old v1 ioctl should have long been migrated to it. It's time we deprecate it and eventually remove it. The only known user is in btrfs-progs that tries v1 as a fallback in case v2 is not supported. This is not necessary anymore. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb523b40 |
|
02-Aug-2021 |
Andreas Gruenbacher <agruenba@redhat.com> |
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Turn fault_in_pages_{readable,writeable} into versions that return the number of bytes not faulted in, similar to copy_to_user, instead of returning a non-zero value when any of the requested pages couldn't be faulted in. This supports the existing users that require all pages to be faulted in as well as new users that are happy if any pages can be faulted in. Rename the functions to fault_in_{readable,writeable} to make sure this change doesn't silently break things. Neither of these functions is entirely trivial and it doesn't seem useful to inline them, so move them to mm/gup.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
#
e77fbf99 |
|
22-Oct-2021 |
David Sterba <dsterba@suse.com> |
btrfs: send: prepare for v2 protocol This is preparatory work for send protocol update to version 2 and higher. We have many pending protocol update requests but still don't have the basic protocol rev in place, the first thing that must happen is to do the actual versioning support. The protocol version is u32 and is a new member in the send ioctl struct. Validity of the version field is backed by a new flag bit. Old kernels would fail when a higher version is requested. Version protocol 0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION, that's also exported in sysfs. The version is still unchanged and will be increased once we have new incompatible commands or stream updates. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
24bcb454 |
|
19-Oct-2021 |
Omar Sandoval <osandov@fb.com> |
btrfs: fix deadlock when defragging transparent huge pages Attempting to defragment a Btrfs file containing a transparent huge page immediately deadlocks with the following stack trace: #0 context_switch (kernel/sched/core.c:4940:2) #1 __schedule (kernel/sched/core.c:6287:8) #2 schedule (kernel/sched/core.c:6366:3) #3 io_schedule (kernel/sched/core.c:8389:2) #4 wait_on_page_bit_common (mm/filemap.c:1356:4) #5 __lock_page (mm/filemap.c:1648:2) #6 lock_page (./include/linux/pagemap.h:625:3) #7 pagecache_get_page (mm/filemap.c:1910:4) #8 find_or_create_page (./include/linux/pagemap.h:420:9) #9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9) #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14) #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9) #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9) #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9) #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10) #15 vfs_ioctl (fs/ioctl.c:51:10) #16 __do_sys_ioctl (fs/ioctl.c:874:11) #17 __se_sys_ioctl (fs/ioctl.c:860:1) #18 __x64_sys_ioctl (fs/ioctl.c:860:1) #19 do_syscall_x64 (arch/x86/entry/common.c:50:14) #20 do_syscall_64 (arch/x86/entry/common.c:80:7) #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113) A huge page is represented by a compound page, which consists of a struct page for each PAGE_SIZE page within the huge page. The first struct page is the "head page", and the remaining are "tail pages". Defragmentation attempts to lock each page in the range. However, lock_page() on a tail page actually locks the corresponding head page. So, if defragmentation tries to lock more than one struct page in a compound page, it tries to lock the same head page twice and deadlocks with itself. Ideally, we should be able to defragment transparent huge pages. However, THP for filesystems is currently read-only, so a lot of code is not ready to use huge pages for I/O. For now, let's just return ETXTBUSY. This can be reproduced with the following on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y: $ cat create_thp_file.c #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> static const char zeroes[1024 * 1024]; static const size_t FILE_SIZE = 2 * 1024 * 1024; int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } int fd = creat(argv[1], 0777); if (fd == -1) { perror("creat"); return EXIT_FAILURE; } size_t written = 0; while (written < FILE_SIZE) { ssize_t ret = write(fd, zeroes, sizeof(zeroes) < FILE_SIZE - written ? sizeof(zeroes) : FILE_SIZE - written); if (ret < 0) { perror("write"); return EXIT_FAILURE; } written += ret; } close(fd); fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } /* * Reserve some address space so that we can align the file mapping to * the huge page size. */ void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (placeholder_map == MAP_FAILED) { perror("mmap (placeholder)"); return EXIT_FAILURE; } void *aligned_address = (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1)); void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0); if (map == MAP_FAILED) { perror("mmap"); return EXIT_FAILURE; } if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) { perror("madvise"); return EXIT_FAILURE; } char *line = NULL; size_t line_capacity = 0; FILE *smaps_file = fopen("/proc/self/smaps", "r"); if (!smaps_file) { perror("fopen"); return EXIT_FAILURE; } for (;;) { for (size_t off = 0; off < FILE_SIZE; off += 4096) ((volatile char *)map)[off]; ssize_t ret; bool this_mapping = false; while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) { unsigned long start, end, huge; if (sscanf(line, "%lx-%lx", &start, &end) == 2) { this_mapping = (start <= (uintptr_t)map && (uintptr_t)map < end); } else if (this_mapping && sscanf(line, "FilePmdMapped: %ld", &huge) == 1 && huge > 0) { return EXIT_SUCCESS; } } sleep(6); rewind(smaps_file); fflush(smaps_file); } } $ ./create_thp_file huge $ btrfs fi defrag -czstd ./huge Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a15eb72 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls For device removal and replace we call btrfs_find_device_by_devspec, which if we give it a device path and nothing else will call btrfs_get_dev_args_from_path, which opens the block device and reads the super block and then looks up our device based on that. However at this point we're holding the sb write "lock", so reading the block device pulls in the dependency of ->open_mutex, which produces the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #405 Not tainted ------------------------------------------------------ losetup/11576 is trying to acquire lock: ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_get_by_dev.part.0+0x56/0x3c0 blkdev_get_by_path+0x98/0xa0 btrfs_get_bdev_and_sb+0x1b/0xb0 btrfs_find_device_by_devspec+0x12b/0x1c0 btrfs_rm_device+0x127/0x610 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11576: #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f31b02404cb Instead what we want to do is populate our device lookup args before we grab any locks, and then pass these args into btrfs_rm_device(). From there we can find the device and do the appropriate removal. Suggested-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
562d7b15 |
|
05-Oct-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle device lookup with btrfs_dev_lookup_args We have a lot of device lookup functions that all do something slightly different. Clean this up by adding a struct to hold the different lookup criteria, and then pass this around to btrfs_find_device() so it can do the proper matching based on the lookup criteria. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c22a3572 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: enable defrag for subpage case With the new infrastructure which has taken subpage into consideration, now we should be safe to allow defrag to work for subpage case. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c6357573 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: remove the old infrastructure Now the old infrastructure can all be removed, defrag Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b508037 |
|
27-May-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file() The function defrag_one_cluster() is able to defrag one range well enough, we only need to do preparation for it, including: - Clamp and align the defrag range - Exclude invalid cases - Proper inode locking The old infrastructures will not be removed in this patch, as it would be too noisy to review. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b18c3ab2 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: introduce helper to defrag one cluster This new helper, defrag_one_cluster(), will defrag one cluster (at most 256K): - Collect all initial targets - Kick in readahead when possible - Call defrag_one_range() on each initial target With some extra range clamping. - Update @sectors_defragged parameter This involves one behavior change, the defragged sectors accounting is no longer as accurate as old behavior, as the initial targets are not consistent. We can have new holes punched inside the initial target, and we will skip such holes later. But the defragged sectors accounting doesn't need to be that accurate anyway, thus I don't want to pass those extra accounting burden into defrag_one_range(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e9eec721 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: introduce helper to defrag a range A new helper, defrag_one_range(), is introduced to defrag one range. This function will mostly prepare the needed pages and extent status for defrag_one_locked_target(). As we can only have a consistent view of extent map with page and extent bits locked, we need to re-check the range passed in to get a real target list for defrag_one_locked_target(). Since defrag_collect_targets() will call defrag_lookup_extent() and lock extent range, we also need to teach those two functions to skip extent lock. Thus new parameter, @locked, is introduced to skip extent lock if the caller has already locked the range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
22b398ee |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: introduce helper to defrag a contiguous prepared range A new helper, defrag_one_locked_target(), introduced to do the real part of defrag. The caller needs to ensure both page and extents bits are locked, and no ordered extent exists for the range, and all writeback is finished. The core defrag part is pretty straight-forward: - Reserve space - Set extent bits to defrag - Update involved pages to be dirty Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eb793cf8 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: introduce helper to collect target file extents Introduce a helper, defrag_collect_targets(), to collect all possible targets to be defragged. This function will not consider things like max_sectors_to_defrag, thus caller should be responsible to ensure we don't exceed the limit. This function will be the first stage of later defrag rework. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5767b50c |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: factor out page preparation into a helper In cluster_pages_for_defrag(), we have complex code block inside one for() loop. The code block is to prepare one page for defrag, this will ensure: - The page is locked and set up properly. - No ordered extent exists in the page range. - The page is uptodate. This behavior is pretty common and will be reused by later defrag rework. So factor out the code into its own helper, defrag_prepare_one_page(), for later usage, and cleanup the code by a little. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
76068cae |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: replace hard coded PAGE_SIZE with sectorsize When testing subpage defrag support, I always find some strange inode nbytes error, after a lot of debugging, it turns out that defrag_lookup_extent() is using PAGE_SIZE as size for lookup_extent_mapping(). Since lookup_extent_mapping() is calling __lookup_extent_mapping() with @strict == 1, this means any extent map smaller than one page will be ignored, prevent subpage defrag to grab a correct extent map. There are quite some PAGE_SIZE usage in ioctl.c, but most of them are correct usages, and can be one of the following cases: - ioctl structure size check We want ioctl structure to be contained inside one page. - real page operations The remaining cases in defrag_lookup_extent() and check_defrag_in_cache() will be addressed in this patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cae79686 |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: also check PagePrivate for subpage cases in cluster_pages_for_defrag() In function cluster_pages_for_defrag() we have a window where we unlock page, either start the ordered range or read the content from disk. When we re-lock the page, we need to make sure it still has the correct page->private for subpage. Thus add the extra PagePrivate check here to handle subpage cases properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1ccc2e8a |
|
06-Aug-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file() Currently btrfs_defrag_file() accepts both "struct inode" and "struct file" as parameter. We can easily grab "struct inode" from "struct file" using file_inode() helper. The reason why we need "struct file" is just to re-use its f_ra. Change this to pass "struct file_ra_state" parameter, so that it's more clear what we really want. Since we're here, also add some comments on the function btrfs_defrag_file(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
991a3dae |
|
28-Jul-2021 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop unnecessary ret in ioctl_quota_rescan_status There is no need for the variable ret after d66105cfa873 ("btrfs: allocate btrfs_ioctl_quota_rescan_args on stack"), remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cda00eba |
|
17-Oct-2021 |
Christoph Hellwig <hch@lst.de> |
btrfs: use bdev_nr_bytes instead of open coding it Use the proper helper to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20211018101130.1838532-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
3fa421de |
|
27-Jul-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: delay blkdev_put until after the device remove When removing the device we call blkdev_put() on the device once we've removed it, and because we have an EXCL open we need to take the ->open_mutex on the block device to clean it up. Unfortunately during device remove we are holding the sb writers lock, which results in the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #407 Not tainted ------------------------------------------------------ losetup/11595 is trying to acquire lock: ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_put+0x3a/0x220 btrfs_rm_device.cold+0x62/0xe5 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11595: #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc21255d4cb So instead save the bdev and do the put once we've dropped the sb writers lock in order to avoid the lockdep recursion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6623d9a0 |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped INO_LOOKUP_USER ioctl The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl and has the following restrictions. The main difference between the two is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER is scoped beneath the file descriptor passed with the ioctl. Specifically, INO_LOOKUP_USER must adhere to the following restrictions: - The caller must be privileged over each inode of each path component for the path they are trying to lookup. - The path for the subvolume the caller is trying to lookup must be reachable from the inode associated with the file descriptor passed with the ioctl. The second condition makes it possible to scope the lookup of the path to the mount identified by the file descriptor passed with the ioctl. This allows us to enable this ioctl on idmapped mounts. Specifically, this is possible because all child subvolumes of a parent subvolume are reachable when the parent subvolume is mounted. So if the user had access to open the parent subvolume or has been given the fd then they can lookup the path if they had access to it provided they were privileged over each path component. Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name of a subvolume even though they would otherwise be restricted from doing so via regular VFS-based lookup. So think about a parent subvolume with multiple child subvolumes. Someone could mount he parent subvolume and restrict access to the child subvolumes by overmounting them with empty directories. At this point the user can't traverse the child subvolumes and they can't open files in the child subvolumes. However, they can still learn the path of child subvolumes as long as they have access to the parent subvolume by using the INO_LOOKUP_USER ioctl. The underlying assumption here is that it's ok that the lookup ioctls can't really take mounts into account other than the original mount the fd belongs to during lookup. Since this assumption is baked into the original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39e1674f |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl Setting flags on subvolumes or snapshots are core features of btrfs. The SUBVOL_SETFLAGS ioctl is especially important as it allows to make subvolumes and snapshots read-only or read-write. Allow setting flags on btrfs subvolumes and snapshots on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4fed17a |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls The SET_RECEIVED_SUBVOL ioctls are used to set information about a received subvolume. Make it possible to set information about a received subvolume on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aabb34e7 |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids So far we prevented the deletion of subvolumes and snapshots using subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag. This restriction is necessary on idmapped mounts as this allows filesystem wide subvolume and snapshot deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. Deletion by subvolume id works by looking for an alias of the parent of the subvolume or snapshot to be deleted. The parent alias can be anywhere in the filesystem. However, as long as the alias of the parent that is found is the same as the one identified by the file descriptor passed through the ioctl we can allow the deletion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4ed533b |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped SNAP_DESTROY ioctls Destroying subvolumes and snapshots are important features of btrfs. Both operations are available to unprivileged users if the filesystem has been mounted with the "user_subvol_rm_allowed" mount option. Allow subvolume and snapshot deletion on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Subvolumes and snapshots can either be deleted by specifying their name or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set. This feature is blocked on idmapped mounts as this allows filesystem wide subvolume deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. This means that even the root or CAP_SYS_ADMIN capable user can't delete a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional. The root user is currently already subject to permission checks in btrfs_may_delete() including whether the inode's i_uid/i_gid of the directory the subvolume is located in have a mapping in the caller's idmapping. For this to fail isn't currently possible since a btrfs filesystem can't be mounted with a non-initial idmapping but it shows that even the root user would fail to delete a subvolume if the relevant inode isn't mapped in their idmapping. The idmapped mount case is the same in principle. This isn't a huge problem a root user wanting to delete arbitrary subvolumes can just always create another (even detached) mount without an idmapping attached. In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the subvolume to delete is directly located under inode referenced by the fd passed for the ioctl() in a follow-up commit. Here is an example where a btrfs subvolume is deleted through a subvolume mount that does not expose the subvolume to be delete but it can still be deleted by using the subvolume id: /* Compile the following program as "delete_by_spec". */ #define _GNU_SOURCE #include <fcntl.h> #include <inttypes.h> #include <linux/btrfs.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> static int rm_subvolume_by_id(int fd, uint64_t subvolid) { struct btrfs_ioctl_vol_args_v2 args = {}; int ret; args.flags = BTRFS_SUBVOL_SPEC_BY_ID; args.subvolid = subvolid; ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args); if (ret < 0) return -1; return 0; } int main(int argc, char *argv[]) { int subvolid = 0; if (argc < 3) exit(1); fprintf(stderr, "Opening %s\n", argv[1]); int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY); if (fd < 0) exit(2); subvolid = atoi(argv[2]); fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid); int ret = rm_subvolume_by_id(fd, subvolid); if (ret < 0) exit(3); exit(0); } #include <stdio.h>" #include <stdlib.h>" #include <linux/btrfs.h" truncate -s 10G btrfs.img mkfs.btrfs btrfs.img export LOOPDEV=$(sudo losetup -f --show btrfs.img) mount ${LOOPDEV} /mnt sudo chown $(id -u):$(id -g) /mnt btrfs subvolume create /mnt/A btrfs subvolume create /mnt/B/C # Get subvolume id via: sudo btrfs subvolume show /mnt/A # Save subvolid SUBVOLID=<nr> sudo umount /mnt sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt ./delete_by_spec /mnt ${SUBVOLID} With idmapped mounts this can potentially be used by users to delete subvolumes/snapshots they would otherwise not have access to as the idmapping would be applied to an inode that is not exposed in the mount of the subvolume. The fact that this is a filesystem wide operation suggests it might be a good idea to expose this under a separate ioctl that clearly indicates this. In essence, the file descriptor passed with the ioctl is merely used to identify the filesystem on which to operate when BTRFS_SUBVOL_SPEC_BY_ID is used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4d4340c9 |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5474bf40 |
|
26-Jul-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
btrfs: check whether fsgid/fsuid are mapped during subvolume creation When a new subvolume is created btrfs currently doesn't check whether the fsgid/fsuid of the caller actually have a mapping in the user namespace attached to the filesystem. The VFS always checks this to make sure that the caller's fsgid/fsuid can be represented on-disk. This is most relevant for filesystems that can be mounted inside user namespaces but it is in general a good hardening measure to prevent unrepresentable gid/uid from being written to disk. Since we want to support idmapped mounts for btrfs ioctls to create subvolumes in follow-up patches this becomes important since we want to make sure the fsgid/fsuid of the caller as mapped according to the idmapped mount can be represented on-disk. Simply add the missing fsuidgid_has_mapping() line from the VFS may_create() version to btrfs_may_create(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c853a578 |
|
27-Jul-2021 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: allocate btrfs_ioctl_defrag_range_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args, allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_defrag_range_args) = 48 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0afb603a |
|
27-Jul-2021 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: allocate btrfs_ioctl_quota_rescan_args on stack Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args, allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_quota_rescan_args) = 64 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0ff40a91 |
|
29-Jul-2021 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: introduce btrfs_search_backwards function It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
14605409 |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: initial fsverity support Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
77eea05e |
|
30-Jun-2021 |
Boris Burkov <boris@bur.io> |
btrfs: add ro compat flags to inodes Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95ea0486 |
|
26-Jul-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: allow read-write for 4K sectorsize on 64K page size systems Since now we support data and metadata read-write for subpage, remove the RO requirement for subpage mount. There are some extra limitations though: - For now, subpage RW mount is still considered experimental Thus that mount warning will still be there. - No compression support There are still quite some PAGE_SIZE hard coded and quite some call sites use extent_clear_unlock_delalloc() to unlock locked_page. This will screw up subpage helpers. Now for subpage RW mount, no matter what mount option or inode attr is set, all writes will not be compressed. Although reading compressed data has no problem. - No defrag for subpage case The defrag support for subpage case will come in later patches, which will also rework the defrag workflow. - No inline extent will be created This is mostly due to the fact that filemap_fdatawrite_range() will trigger more write than the range specified. In fallocate calls, this behavior can make us to writeback which can be inlined, before we enlarge the i_size. This is a very special corner case, and even current btrfs check won't report error on such inline extent + regular extent. But considering how much effort has been put to prevent such inline + regular, I'd prefer to cut off inline extent completely until we have a good solution. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a9fd417 |
|
21-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: fix typos in comments Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32cc4f87 |
|
03-Jun-2021 |
David Sterba <dsterba@suse.com> |
btrfs: sink wait_for_unblock parameter to async commit There's only one caller left btrfs_ioctl_start_sync that passes 0, so we can remove the switch in btrfs_commit_transaction_async. A cleanup 9babda9f33fd ("btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed 1, so this is a followup. As this removes last call of wait_current_trans_commit_start_and_unblock, remove the function as well. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67ae34b6 |
|
14-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: add device delete cancel Accept device name "cancel" as a request to cancel running device deletion operation. The string is literal, in case there's a real device named "cancel", pass it as full absolute path or as "./cancel" This works for v1 and v2 ioctls when the device is specified by name. Moving chunks from the device uses relocation, use the conditional exclusive operation start and cancellation helpers Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bb059a37 |
|
18-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: add cancellation to resize Accept literal string "cancel" as resize operation and interpret that as a request to cancel the running operation. If it's running, wait until it finishes current work and return ECANCELED. Shrinking resize uses relocation to move the chunks away, use the conditional exclusive operation start and cancellation helpers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
17aaa434 |
|
14-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: add wrapper for conditional start of exclusive operation To support optional cancellation of some operations, add helper that will wrap all the combinations. In normal mode it's same as btrfs_exclop_start, in cancellation mode it checks if it's already running and request cancellation and waits until completion. The error codes can be returned to to user space and semantics is not changed, adding ECANCELED. This should be evaluated as an error and that the operation has not completed and the operation should be restarted or the filesystem status reviewed. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
578bda9e |
|
18-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: introduce try-lock semantics for exclusive op start Add try-lock for exclusive operation start to allow callers to do more checks. The same operation must already be running. The try-lock and unlock must pair and are a substitute for btrfs_exclop_start, thus it must also pair with btrfs_exclop_finish to release the exclop context. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0d7ed32c |
|
14-May-2021 |
David Sterba <dsterba@suse.com> |
btrfs: protect exclusive_operation by super_lock The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50535db8 |
|
04-May-2021 |
Tian Tao <tiantao6@hisilicon.com> |
btrfs: return EAGAIN if defrag is canceled When inode defrag is canceled, the error is set to EAGAIN but then overwritten by number of defragmented bytes. As this would hide the error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it will print the error and continue to next file (as it does in for any other error). Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9b8a233b |
|
30-Apr-2021 |
Ritesh Harjani <riteshh@linux.ibm.com> |
btrfs: handle transaction start error in btrfs_fileattr_set Add error handling in btrfs_fileattr_set in case of an error while starting a transaction. This fixes btrfs/232 which otherwise used to fail with below signature on Power. btrfs/232 [ 1119.474650] run fstests btrfs/232 at 2021-04-21 02:21:22 <...> [ 1366.638585] BUG: Unable to handle kernel data access on read at 0xffffffffffffff86 [ 1366.638768] Faulting instruction address: 0xc0000000009a5c88 cpu 0x0: Vector: 380 (Data SLB Access) at [c000000014f177b0] pc: c0000000009a5c88: btrfs_update_root_times+0x58/0xc0 lr: c0000000009a5c84: btrfs_update_root_times+0x54/0xc0 <...> pid = 24881, comm = fsstress btrfs_update_inode+0xa0/0x140 btrfs_fileattr_set+0x5d0/0x6f0 vfs_fileattr_set+0x2a8/0x390 do_vfs_ioctl+0x1290/0x1ac0 sys_ioctl+0x6c/0x120 system_call_exception+0x3d4/0x410 system_call_common+0xec/0x278 Fixes: 97fc29775487 ("btrfs: convert to fileattr") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f9baa501 |
|
21-Apr-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extents and using qgroups There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97fc2977 |
|
07-Apr-2021 |
Miklos Szeredi <mszeredi@redhat.com> |
btrfs: convert to fileattr Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: David Sterba <dsterba@suse.com>
|
#
67addf29 |
|
20-Apr-2021 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix metadata extent leak after failure to create subvolume When creating a subvolume we allocate an extent buffer for its root node after starting a transaction. We setup a root item for the subvolume that points to that extent buffer and then attempt to insert the root item into the root tree - however if that fails, due to ENOMEM for example, we do not free the extent buffer previously allocated and we do not abort the transaction (as at that point we did nothing that can not be undone). This means that we effectively do not return the metadata extent back to the free space cache/tree and we leave a delayed reference for it which causes a metadata extent item to be added to the extent tree, in the next transaction commit, without having backreferences. When this happens 'btrfs check' reports the following: $ btrfs check /dev/sdi Opening filesystem to check... Checking filesystem on /dev/sdi UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb [1/7] checking root items [2/7] checking extents ref mismatch on [30425088 16384] extent item 1, found 0 backref 30425088 root 256 not referenced back 0x564a91c23d70 incorrect global backref count on 30425088 found 1 wanted 0 backpointer mismatch on [30425088 16384] owner ref check failed [30425088 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 212992 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124669 file data blocks allocated: 65536 referenced 65536 So fix this by freeing the metadata extent if btrfs_insert_root() returns an error. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
221581e4 |
|
12-Mar-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: handle btrfs_record_root_in_trans failure in create_subvol btrfs_record_root_in_trans will return errors in the future, so handle the error properly in create_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
64708539 |
|
10-Feb-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5011c5a6 |
|
16-Feb-2021 |
Dan Carpenter <dancarpenter@oracle.com> |
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl The problem is we're copying "inherit" from user space but we don't necessarily know that we're copying enough data for a 64 byte struct. Then the next problem is that 'inherit' has a variable size array at the end, and we have to verify that array is the size we expected. Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ba73d987 |
|
21-Jan-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
namei: handle idmapped mounts in may_*() helpers The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
#
21cb47be |
|
21-Jan-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
inode: make init and permission helpers idmapped mount aware The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
#
47291baa |
|
21-Jan-2021 |
Christian Brauner <christian.brauner@ubuntu.com> |
namei: make permission helpers idmapped mount aware The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
#
1cb3dc3f |
|
04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: disallow fitrim on zoned filesystems The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32443de3 |
|
26-Jan-2021 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce btrfs_subpage for data inodes To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9db4dc24 |
|
10-Jan-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_start_delalloc_root's nr argument a long It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69948022 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove new_dirid argument from btrfs_create_subvol_root It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23125104 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_root::free_objectid hold the next available objectid Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b8fad57 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_root::highest_objectid to free_objectid This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
543068a2 |
|
07-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3d45f221 |
|
02-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extent and low on free metadata space When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock: * the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation; * the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation; * the async reclaim task blocks waiting for the delalloc work to complete; * the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree; * if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks. When this happens, traces like the following show up in dmesg/syslog: [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...) Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path). Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes. This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem. CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5297199a |
|
26-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove inode number cache feature It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons. A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed. [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d206e9c9 |
|
10-Nov-2020 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: disallow NODATACOW in ZONED mode NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9a56fcd1 |
|
02-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_update_inode take btrfs_inode Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b2598edf |
|
02-Nov-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: remove unused argument seed from btrfs_find_device Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to check if the parameter seed is true in the function btrfs_find_device(). This tells it whether to traverse the seed device list or not. After this commit, the argument is unused and can be removed. In device_list_add() it's not necessary because fs_devices always points to the device's fs_devices. So with the devid+uuid matching, it will find the right device and return, thus not needing to traverse seed devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7f458a38 |
|
04-Nov-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race when defragmenting leads to unnecessary IO When defragmenting we skip ranges that have holes or inline extents, so that we don't do unnecessary IO and waste space. We do this check when calling should_defrag_range() at btrfs_defrag_file(). However we do it without holding the inode's lock. The reason we do it like this is to avoid blocking other tasks for too long, that possibly want to operate on other file ranges, since after the call to should_defrag_range() and before locking the inode, we trigger a synchronous page cache readahead. However before we were able to lock the inode, some other task might have punched a hole in our range, or we may now have an inline extent there, in which case we should not set the range for defrag anymore since that would cause unnecessary IO and make us waste space (i.e. allocating extents to contain zeros for a hole). So after we locked the inode and the range in the iotree, check again if we have holes or an inline extent, and if we do, just skip the range. I hit this while testing my next patch that fixes races when updating an inode's number of bytes (subject "btrfs: update the number of bytes used by an inode atomically"), and it depends on this change in order to work correctly. Alternatively I could rework that other patch to detect holes and flag their range with the 'new delalloc' bit, but this itself fixes an efficiency problem due a race that from a functional point of view is not harmful (it could be triggered with btrfs/062 from fstests). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b9729ce0 |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: locking: rip out path->leave_spinning We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a1fbc675 |
|
04-Oct-2020 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch On 32-bit systems, this shift will overflow for files larger than 4GB as start_index is unsigned long while the calls to btrfs_delalloc_*_space expect u64. CC: stable@vger.kernel.org # 4.4+ Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> [ define the variable instead of repeating the shift ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c0a43603 |
|
17-Sep-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove inode argument from btrfs_start_ordered_extent The passed in ordered_extent struct is always well-formed and contains the inode making the explicit argument redundant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72804905 |
|
01-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: kill the RCU protection for fs_info->space_info We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount. The list never changes and we're protected from messing with it by the normal mount/umount path, so kill the RCU stuff around it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66a2823c |
|
25-Aug-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: sysfs: export currently running exclusive operation /sys/fs/<fsid>/exclusive_operation contains the currently executing exclusive operation. Add a sysfs_notify() when operation end, so userspace can be notified of exclusive operation is finished. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3e1f96c |
|
25-Aug-2020 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: enumerate the type of exclusive operation in progress Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9631e4cc |
|
20-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e85fde51 |
|
24-Jul-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations [BUG] When quota is enabled for TEST_DEV, generic/013 sometimes fails like this: generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg) And with the following metadata leak: BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace a6cfd45ba80e4e06 ]--- BTRFS error (device dm-3): qgroup reserved space leaked BTRFS info (device dm-3): disk space caching is enabled BTRFS info (device dm-3): has skinny extents [CAUSE] The qgroup preallocated meta rsv operations of that offending root are: btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 It's pretty obvious that, we reserve qgroup meta rsv in btrfs_subvolume_reserve_metadata(), but doesn't have corresponding release/convert calls in btrfs_subvolume_release_metadata(). This leads to the leakage. [FIX] To fix this bug, we should follow what we're doing in btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and add it to block_rsv->qgroup_rsv_reserved. And free the qgroup reserved metadata space when releasing the block_rsv. To do this, we need to change the btrfs_subvolume_release_metadata() to accept btrfs_root, and record the qgroup_to_release number, and call btrfs_qgroup_convert_reserved_meta() for it. Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b4912139 |
|
21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: change nr to u64 in btrfs_start_delalloc_roots We have btrfs_wait_ordered_roots() which takes a u64 for nr, but btrfs_start_delalloc_roots() that takes an int for nr, which makes using them in conjunction, especially for something like (u64)-1, annoying and inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and adjust start_delalloc_inodes() and it's callers appropriately. This means we've adjusted start_delalloc_inodes() to take a pointer of nr since we want to preserve the ability for start-delalloc_inodes() to return an error, so simply make it do the nr adjusting as necessary. Part of adjusting the callers to this means changing btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be confusing because it seems unrelated, but the caller of btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the function variable that needs to be changed. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1c78544e |
|
14-Sep-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix wrong address when faulting in pages in the search ioctl When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") was added to stable trees, and these applications make heavy use of the search ioctls. This fixes their issues. Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/ Link: https://github.com/kilobyte/compsize/issues/34 Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") CC: stable@vger.kernel.org # 4.4+ Tested-by: A L <mail@lechevalier.se> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a48b73ec |
|
10-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix potential deadlock in the search ioctl With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted ------------------------------------------------------ compsize/11122 is trying to acquire lock: ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-fs-00){++++}-{3:3}: down_write_nested+0x3b/0x70 __btrfs_tree_lock+0x24/0x120 btrfs_search_slot+0x756/0x990 btrfs_lookup_inode+0x3a/0xb4 __btrfs_update_delayed_inode+0x93/0x270 btrfs_async_run_delayed_root+0x168/0x230 btrfs_work_helper+0xd4/0x570 process_one_work+0x2ad/0x5f0 worker_thread+0x3a/0x3d0 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #1 (&delayed_node->mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_delayed_update_inode+0x50/0x440 btrfs_update_inode+0x8a/0xf0 btrfs_dirty_inode+0x5b/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&mm->mmap_lock#2){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-fs-00); lock(&delayed_node->mutex); lock(btrfs-fs-00); lock(&mm->mmap_lock#2); *** DEADLOCK *** 1 lock held by compsize/11122: #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __might_fault+0x3e/0x90 ? find_held_lock+0x72/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 ? btrfs_search_forward+0x2a6/0x360 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ? __do_sys_newfstat+0x5a/0x70 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is we're doing a copy_to_user() while holding tree locks, which can deadlock if we have to do a page fault for the copy_to_user(). This exists even without my locking changes, so it needs to be fixed. Rework the search ioctl to do the pre-fault and then copy_to_user_nofault for the copying. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f37c563b |
|
10-Jul-2020 |
David Sterba <dsterba@suse.com> |
btrfs: add missing check for nocow and compression inode flags User Forza reported on IRC that some invalid combinations of file attributes are accepted by chattr. The NODATACOW and compression file flags/attributes are mutually exclusive, but they could be set by 'chattr +c +C' on an empty file. The nodatacow will be in effect because it's checked first in btrfs_run_delalloc_range. Extend the flag validation to catch the following cases: - input flags are conflicting - old and new flags are conflicting - initialize the local variable with inode flags after inode ls locked Inode attributes take precedence over mount options and are an independent setting. Nocompress would be a no-op with nodatacow, but we don't want to mix any compression-related options with nodatacow. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49bac897 |
|
13-Jul-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: add metadata_uuid to FS_INFO ioctl Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0fb408a5 |
|
13-Jul-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: add filesystem generation to FS_INFO ioctl Add retrieval of the filesystem's generation to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
137c5418 |
|
13-Jul-2020 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl With the recent addition of filesystem checksum types other than CRC32c, it is not anymore hard-coded which checksum type a btrfs filesystem uses. Up to now there is no good way to read the filesystem checksum, apart from reading the filesystem UUID and then query sysfs for the checksum type. Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl command which usually is used to query filesystem features. Also add a flags member indicating that the kernel responded with a set csum_type and csum_size field. For compatibility reasons, only return the csum_type and csum_size if the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also clear any unknown flags so we don't pass false positives to user-space newer than the kernel. To simplify further additions to the ioctl, also switch the padding to a u8 array. Pahole was used to verify the result of this switch: The csum members are added before flags, which might look odd, but this is to keep the alignment requirements and not to introduce holes in the structure. $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko struct btrfs_ioctl_fs_info_args { __u64 max_id; /* 0 8 */ __u64 num_devices; /* 8 8 */ __u8 fsid[16]; /* 16 16 */ __u32 nodesize; /* 32 4 */ __u32 sectorsize; /* 36 4 */ __u32 clone_alignment; /* 40 4 */ __u16 csum_type; /* 44 2 */ __u16 csum_size; /* 46 2 */ __u64 flags; /* 48 8 */ __u8 reserved[968]; /* 56 968 */ /* size: 1024, cachelines: 16, members: 10 */ }; Fixes: 3951e7f050ac ("btrfs: add xxhash64 to checksumming algorithms") Fixes: 3831bf0094ab ("btrfs: add sha256 to checksumming algorithm") CC: stable@vger.kernel.org # 5.5+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2dfb1e43 |
|
15-Jun-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: preallocate anon block device at first phase of snapshot creation [BUG] When the anonymous block device pool is exhausted, subvolume/snapshot creation fails with EMFILE (Too many files open). This has been reported by a user. The allocation happens in the second phase during transaction commit where it's only way out is to abort the transaction BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] When the global anonymous block device pool is exhausted, the following call chain will fail, and lead to transaction abort: btrfs_ioctl_snap_create_v2() |- btrfs_ioctl_snap_create_transid() |- btrfs_mksubvol() |- btrfs_commit_transaction() |- create_pending_snapshot() |- btrfs_get_fs_root() |- btrfs_init_fs_root() |- get_anon_bdev() [FIX] Although we can't enlarge the anonymous block device pool, at least we can preallocate anon_dev for subvolume/snapshot in the first phase, outside of transaction context and exactly at the moment the user calls the creation ioctl. Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e5b7231e |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_delalloc_reserve_space take btrfs_inode All of its children take btrfs_inode so bubble up this requirement to btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I internally. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86d52921 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_delalloc_release_space take btrfs_inode It needs btrfs_inode so take it as a parameter directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c3504372 |
|
02-Jun-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_lookup_ordered_extent take btrfs_inode It doesn't use the generic vfs inode for anything use btrfs_inode directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b091f7fe |
|
16-Jun-2020 |
Waiman Long <longman@redhat.com> |
btrfs: use kfree() in btrfs_ioctl_get_subvol_info() In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kzfree() isn't really needed, use kfree() instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0202e83f |
|
15-May-2020 |
David Sterba <dsterba@suse.com> |
btrfs: simplify iget helpers The inode lookup starting at btrfs_iget takes the full location key, while only the objectid is used to match the inode, because the lookup happens inside the given root thus the inode number is unique. The entire location key is properly set up in btrfs_init_locked_inode. Simplify the helpers and pass only inode number, renaming it to 'ino' instead of 'objectid'. This allows to remove temporary variables key, saving some stack space. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
56e9357a |
|
15-May-2020 |
David Sterba <dsterba@suse.com> |
btrfs: simplify root lookup by id The main function to lookup a root by its id btrfs_get_fs_root takes the whole key, while only using the objectid. The value of offset is preset to (u64)-1 but not actually used until btrfs_find_root that does the actual search. Switch btrfs_get_fs_root to use only objectid and remove all local variables that existed just for the lookup. The actual key for search is set up in btrfs_get_fs_root, reusing another key variable. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c11fbb6e |
|
14-May-2020 |
Robbie Ko <robbieko@synology.com> |
btrfs: reduce lock contention when creating snapshot When creating a snapshot, ordered extents need to be flushed and this can take a long time. In create_snapshot there are two locks held when this happens: 1. Destination directory inode lock 2. Global subvolume semaphore This will unnecessarily block other operations like subvolume destroy, create, or setflag until the snapshot is created. We can fix that by moving the flush outside the locked section as this does not depend on the aforementioned locks. The code factors out the snapshot related work from create_snapshot to btrfs_mksnapshot. __btrfs_ioctl_snap_create btrfs_mksubvol create_subvol btrfs_mksnapshot <flush> btrfs_mksubvol create_snapshot Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92a7cc42 |
|
15-May-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9babda9f |
|
13-Mar-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to pass the async_transid parameter so remove it and any code using it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5d54c67e |
|
13-Mar-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid btrfs_ioctl_snap_create_transid no longer takes a transid argument, so remove it and rename the function to __btrfs_ioctl_snap_create to reflect it's an internal, worker function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9c1036fd |
|
13-Mar-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support This functionality was deprecated in kernel 5.4. Since no one has complained of the impending removal it's time we did so. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6a177381 |
|
28-Feb-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: move all reflink implementation code into its own file The reflink code is quite large and has been living in ioctl.c since ever. It has grown over the years after many bug fixes and improvements, and since I'm planning on making some further improvements on it, it's time to get it better organized by moving into its own file, reflink.c (similar to what xfs does for example). This change only moves the code out of ioctl.c into the new file, it doesn't do any other change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
807fc790 |
|
24-Feb-2020 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
btrfs: switch to use new generic UUID API There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
faf8f7b9 |
|
11-Feb-2020 |
Marcos Paulo de Souza <marcos@mpdesouza.com> |
btrfs: ioctl: resize: only show message if size is changed There is no point to inform the user about size change if there's none. Update the message to conform to a commonly used format where the path and devid are printed and also print old and new sizes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Marcos Paulo de Souza <marcos@mpdesouza.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance message ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dcc3eb96 |
|
30-Jan-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: convert snapshot/nocow exlcusion to drew lock This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drew lock to ensure this invariant still holds. 'Readers' are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drew lock is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
949964c9 |
|
07-Feb-2020 |
Marcos Paulo de Souza <mpdesouza@suse.com> |
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl This ioctl will be responsible for deleting a subvolume using its id. This can be used when a system has a file system mounted from a subvolume, rather than the root file system, like below: / @subvol1/ @subvol2/ @subvol_default/ If only @subvol_default is mounted, we have no path to reach @subvol1 and @subvol2, thus no way to delete them. Current subvolume delete ioctl takes a file handle point as argument, and if @subvol_default is mounted, we can't reach @subvol1 and @subvol2 from the same mount point. This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes the extended structure with flags to allow to delete subvolume using subvolid. Now, we can use this new ioctl specifying the subvolume id and refer to the same mount point. It doesn't matter which subvolume was mounted, since we can reach to the desired one using the subvolume id, and then delete it. The full path to the subvolume id is resolved internally and access is verified as if the subvolume was accessed by path. The volume args v2 structure is extended to use the existing union for subvolume id specification, that's valid in case the BTRFS_SUBVOL_SPEC_BY_ID is set. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
748449cdb |
|
21-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use ioctl args support mask for device delete When the device remove v2 ioctl was added, the full support mask was added to sanity check the flags. However this would allow to let the subvolume related flags to be accepted. This is not supposed to happen. Use the correct support mask, which means that now any of BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is a user-visible change, specifying subvolume flags for device deletion does not make sense and there are hopefully no applications doing that. Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
673990db |
|
21-Feb-2020 |
David Sterba <dsterba@suse.com> |
btrfs: use ioctl args support mask for subvolume create/delete Using the defined mask instead of flag enumeration in the ioctl handler is preferred. No functional changes. Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
00246528 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root We are now using these for all roots, rename them to btrfs_put_root() and btrfs_grab_root(); Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc44d7c4 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root Now that all callers of btrfs_get_fs_root are subsequently calling btrfs_grab_fs_root and handling dropping the ref when they are done appropriately, go ahead and push btrfs_grab_fs_root up into btrfs_get_fs_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5119cfc3 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in create_pending_snapshot We create the snapshot and then use it for a bunch of things, we need to hold a ref on it while we're messing with it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2a2b5d62 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold ref on root in btrfs_ioctl_default_subvol We look up an arbitrary fs root here, we need to hold a ref on the root for the duration. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
04734e84 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info We look up whatever root userspace has given us, we need to hold a ref throughout this operation. Use 'root' only for the on fs root and not as a temporary variable elsewhere. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b8a49ae1 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in btrfs_search_path_in_tree_user We can wander into a different root, so grab a ref on the root we look up. Later on we make root = fs_info->tree_root so we need this separate out label to make sure we do the right cleanup only in the case we're looking up a different root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
88234012 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in btrfs_search_path_in_tree We look up an arbitrary fs root, we need to hold a ref on it while we're doing our search. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3ca35e83 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in search_ioctl We lookup a arbitrary fs root, we need to hold a ref on that root. If we're using our own inodes root then grab a ref on that as well to make the cleanup easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc92f798 |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in create_subvol We're creating the new root here, but we should hold the ref until after we've initialized the inode for it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3619c94f |
|
24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: open code btrfs_read_fs_root_no_name All this does is call btrfs_get_fs_root() with check_ref == true. Just use btrfs_get_fs_root() so we don't have a bunch of different helpers that do the same thing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d923afe9 |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: replace all uses of btrfs_ordered_update_i_size Now that we have a safe way to update the i_size, replace all uses of btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
790a1d44 |
|
17-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update We were using btrfs_i_size_write(), which unconditionally jacks up inode->disk_i_size. However since clone can operate on ranges we could have pending ordered extents for a range prior to the start of our clone operation and thus increase disk_i_size too far and have a hole with no file extent. Fix this by using the btrfs_ordered_update_i_size helper which will do the right thing in the face of pending ordered extents outside of our clone range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
831d2fa2 |
|
16-Dec-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make deduplication with range including the last block work Since btrfs was migrated to use the generic VFS helpers for clone and deduplication, it stopped allowing for the last block of a file to be deduplicated when the source file size is not sector size aligned (when eof is somewhere in the middle of the last block). There are two reasons for that: 1) The generic code always rounds down, to a multiple of the block size, the range's length for deduplications. This means we end up never deduplicating the last block when the eof is not block size aligned, even for the safe case where the destination range's end offset matches the destination file's size. That rounding down operation is done at generic_remap_check_len(); 2) Because of that, the btrfs specific code does not expect anymore any non-aligned range length's for deduplication and therefore does not work if such nona-aligned length is given. This patch addresses that second part, and it depends on a patch that fixes generic_remap_check_len(), in the VFS, which was submitted ealier and has the following subject: "fs: allow deduplication of eof block into the end of the destination file" These two patches address reports from users that started seeing lower deduplication rates due to the last block never being deduplicated when the file size is not aligned to the filesystem's block size. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39b07b5d |
|
02-Dec-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: drop create parameter to btrfs_get_extent() We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5afe6ce7 |
|
16-Jan-2020 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: always copy scrub arguments back to user space If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c7e54b51 |
|
06-Dec-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: abort transaction after failed inode updates in create_subvol We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
147271e3 |
|
05-Dec-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix hole extent items with a zero size after range cloning Normally when cloning a file range if we find an implicit hole at the end of the range we assume it is because the NO_HOLES feature is enabled. However that is not always the case. One well known case [1] is when we have a power failure after mixing buffered and direct IO writes against the same file. In such cases we need to punch a hole in the destination file, and if the NO_HOLES feature is not enabled, we need to insert explicit file extent items to represent the hole. After commit 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents"), we started to insert file extent items representing the hole with an item size of 0, which is invalid and should be 53 bytes (the size of a btrfs_file_extent_item structure), resulting in all sorts of corruptions and invalid memory accesses. This is detected by the tree checker when we attempt to write a leaf to disk. The problem can be sporadically triggered by test case generic/561 from fstests. That test case does not exercise power failure and creates a new filesystem when it starts, so it does not use a filesystem created by any previous test that tests power failure. However the test does both buffered and direct IO writes (through fsstress) and it's precisely that which is creating the implicit holes in files. That happens even before the commit mentioned earlier. I need to investigate why we get those implicit holes to check if there is a real problem or not. For now this change fixes the regression of introducing file extent items with an item size of 0 bytes. Fix the issue by calling btrfs_punch_hole_range() without passing a btrfs_clone_extent_info structure, which ensures file extent items are inserted to represent the hole with a correct item size. We were passing a btrfs_clone_extent_info with a value of 0 for its 'item_size' field, which was causing the insertion of file extent items with an item size of 0. [1] https://www.spinics.net/lists/linux-btrfs/msg75350.html Reported-by: David Sterba <dsterba@suse.com> Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
32da5386 |
|
29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b3470b5d |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add dedicated members for start and length of a block group The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf38be65 |
|
23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move block_group_item::used to block group For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree. The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b929c1d8 |
|
10-Oct-2019 |
Marcos Paulo de Souza <marcos.souza.org@gmail.com> |
btrfs: ioctl: Try to use btrfs_fs_info instead of *file Some functions are doing some unnecessary indirection to reach the btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info struct instead of a *file. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ce96b7ff |
|
10-Oct-2019 |
Chengguang Xu <cgxu519@mykernel.net> |
btrfs: use better definition of number of compression type The compression type upper limit constant is the same as the last value and this is confusing. In order to keep coding style consistent, use BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of 'NR' being one more than the last value. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e1f60a65 |
|
01-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add __pure attribute to functions The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c66e0d4 |
|
03-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused parameter is_new from btrfs_iget The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a5009d3a |
|
04-Nov-2019 |
David Sterba <dsterba@suse.com> |
btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as deprecated and scheduled for removal but we actualy do use them for 'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5fc should have been just the async flag for subvolume creation. The deprecation has been added in this development cycle, remove it until it's time. Fixes: ebc87351e5fc ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag") Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8702ba93 |
|
14-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e182163d |
|
15-Aug-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: stop clearing EXTENT_DIRTY in inode I/O tree Since commit fee187d9d9dd ("Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we can simplify and stop trying to clear it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebc87351 |
|
26-Aug-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag Support for asynchronous snapshot creation was originally added in 72fd032e9424 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for ceph's backend needs. However, since Ceph has deprecated support for btrfs there is no longer need for that support in btrfs. Additionally, this was never supported by btrfs-progs, the official userspace tools. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f10152bc |
|
01-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: sysfs: replace direct access to feature set names with a helper In order to unexport the feature type array, add a helper for the enum-to-string conversion. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aac0023c |
|
20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move basic block_group definitions to their own header This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b64119b5 |
|
02-Jul-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove unnecessary condition in btrfs_clone() to avoid too much nesting The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(), is done inside an if statement that checks if the found key has the type BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because btrfs_search_slot() always leaves us in a leaf slot that points to a key that is always greater then or equals to the search key, and our search key here has that type, and because we bail out before that if statement if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY. Therefore just remove that if statement, not only because it is useless but mostly because it increases the nesting/indentation level in this function which is quite big and makes things a bit awkward whenever I need to fix something that requires changing btrfs_clone() (and it has been like that for many years already). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
40cf931f |
|
16-Jul-2019 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: use common vfs LABEL ioctl definitions I lifted the btrfs label get/set ioctls to the vfs some time ago, but never followed up to use those common definitions directly in btrfs. This patch does that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
690a5dbf |
|
05-Jul-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents When cloning extents (or deduplicating) we create a transaction with a space reservation that considers we will drop or update a single file extent item of the destination inode (that we modify a single leaf). That is fine for the vast majority of scenarios, however it might happen that we need to drop many file extent items, and adjust at most two file extent items, in the destination root, which can span multiple leafs. This will lead to either the call to btrfs_drop_extents() to fail with ENOSPC or the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode() (called through clone_finish_inode_update()) to fail with ENOSPC. Such failure results in a transaction abort, leaving the filesystem in a read-only mode. In order to fix this we need to follow the same approach as the hole punching code, where we create a local reservation with 1 unit and keep ending and starting transactions, after balancing the btree inode, when __btrfs_drop_extents() returns ENOSPC. So fix this by making the extent cloning call calls the recently added btrfs_punch_hole_range() helper, which is what does the mentioned work for hole punching, and make sure whenever we drop extent items in a transaction, we also add a replacing file extent item, to avoid corruption (a hole) if after ending a transaction and before starting a new one, the old transaction gets committed and a power failure happens before we finish cloning. A test case for fstests follows soon. Reported-by: David Goodwin <david@codepoets.co.uk> Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/ Reported-by: Sam Tygier <sam@tygier.co.uk> Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/ Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
86736342 |
|
19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the delalloc space stuff to it's own home We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8719aaae |
|
18-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move space_info to space-info.h Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b0e492e |
|
01-Jul-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: create a generic checking function for FS_IOC_FSSETXATTR Create a generic checking function for the incoming FS_IOC_FSSETXATTR fsxattr values so that we can standardize some of the implementation behaviors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
5aca2842 |
|
01-Jul-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: create a generic checking and prep function for FS_IOC_SETFLAGS Create a generic function to check incoming FS_IOC_SETFLAGS flag values and later prepare the inode for updates so that we can standardize the implementations that follow ext4's flag values. Note that the efivarfs implementation no longer fails a no-op SETFLAGS without CAP_LINUX_IMMUTABLE since that's the behavior in ext*. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
|
#
a94d1d0c |
|
08-May-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation [BUG] The following script can cause unexpected fsync failure: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs mkfs.btrfs -f $dev -b 512M > /dev/null mount $dev $mnt -o nospace_cache # Prealloc one extent xfs_io -f -c "falloc 8k 64m" $mnt/file1 # Fill the remaining data space xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding sync # Write into the prealloc extent xfs_io -c "pwrite 1m 16m" $mnt/file1 # Reflink then fsync, fsync would fail due to ENOSPC xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 umount $dev The fsync fails with ENOSPC, and the last page of the buffered write is lost. [CAUSE] This is caused by: - Btrfs' back reference only has extent level granularity So write into shared extent must be COWed even only part of the extent is shared. So for above script we have: - fallocate Create a preallocated extent where we can do NOCOW write. - fill all the remaining data and unallocated space - buffered write into preallocated space As we have not enough space available for data and the extent is not shared (yet) we fall into NOCOW mode. - reflink Now part of the large preallocated extent is shared, later write into that extent must be COWed. - fsync triggers writeback But now the extent is shared and therefore we must fallback into COW mode, which fails with ENOSPC since there's not enough space to allocate data extents. [WORKAROUND] The workaround is to ensure any buffered write in the related extents (not just the reflink source range) get flushed before reflink/dedupe, so that NOCOW writes succeed that happened before reflinking succeed. The workaround is expensive, we could do it better by only flushing NOCOW range, but that needs extra accounting for NOCOW range. For now, fix the possible data loss first. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
46008d9d |
|
26-May-2019 |
Amir Goldstein <amir73il@gmail.com> |
btrfs: call fsnotify_rmdir() hook This will allow generating fsnotify delete events after the fsnotify_nameremove() hook is removed from d_delete(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
3763771c |
|
12-Jun-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix failure to persist compression property xattr deletion on fsync After the recent series of cleanups in the properties and xattrs modules that landed in the 5.2 merge window, we ended up with a regression where after deleting the compression xattr property through the setflags ioctl, we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore. As a consequence, if the inode was fsync'ed when it had the compression property set, after deleting the compression property through the setflags ioctl and fsync'ing again the inode, the log will still contain the compression xattr, because the inode did not had that bit set, which made the fsync not delete all xattrs from the log and copy all xattrs from the subvolume tree to the log tree. This regression happens due to the fact that that series of cleanups made btrfs_set_prop() call the old function do_setxattr() (which is now named btrfs_setxattr()), and not the old version of btrfs_setxattr(), which is now called btrfs_setxattr_trans(). Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current btrfs_setxattr() function and remove it from everywhere else, including its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar regressions in the future, and centralizes the setup of the bit. After all, the need to setup this bit should only be in the xattrs module, since it is an implementation of xattrs. Fixes: 04e6863b19c722 ("btrfs: split btrfs_setxattr calls regarding transaction") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
44e5194b |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop local copy of inode i_mode There isn't real use of making struct inode::i_mode a local copy, it saves a dereference one time, not much. Just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3c8d8b63 |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop old_fsflags in btrfs_ioctl_setflags btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only once. Instead used it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d2b8fcfe |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: modify local copy of btrfs_inode flags Instead of updating the binode::flags directly, update a local copy, and then at the point of no error, store copy it to the binode::flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11d3cd5c |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop useless inode i_flags copy and restore The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the inode::i_flags update functions such as btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called in btrfs_ioctl_setflags() instead of btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the inode::i_flags remains unmodified until the thread has checked all the conditions. So drop the saved inode::i_flags in out_i_flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff9fef55 |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: start transaction in btrfs_ioctl_setflags() Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This flags also includes compression attribute for which we would set/reset the compression extended attribute. While doing this there is a bit of duplicate code, the following things happens twice: - start/end_transaction - inode_inc_iversion() - current_time update to inode->i_ctime - and btrfs_update_inode() These are updated both at btrfs_ioctl_setflags() and btrfs_set_props() as well. This patch merges these two duplicate codes at btrfs_ioctl_setflags(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f22125e5 |
|
20-Apr-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_set_props to validate externally In preparation to merge multiple transactions when setting the compression flags, split btrfs_set_props() validation part outside of it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
62d54f3a |
|
22-Apr-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between send and deduplication that lead to failures and crashes Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82fa113f |
|
04-Apr-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7984ae52 |
|
25-Feb-2019 |
Goldwyn Rodrigues <rgoldwyn@suse.com> |
btrfs: Perform locking/unlocking in btrfs_remap_file_range() Move code to make it more readable, so as locking and unlocking is done in the same function. The generic checks that are now performed in the locked section are unaffected. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
262c96a3 |
|
28-Feb-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_set_prop and add btrfs_set_prop_trans btrfs_set_prop() takes transaction pointer as the first argument, however in ioctl.c for the purpose of setting the compression property, we call btrfs_set_prop() with NULL transaction pointer. Down in the call chain btrfs_setxattr() starts transaction to update the attribute and also to update the inode. So for clarity, create btrfs_set_prop_trans() with no transaction pointer as argument, in preparation to start transaction here instead of doing it down the call chain at btrfs_setxattr(). Also now the btrfs_set_prop() is a static function. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7715da84 |
|
28-Feb-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: merge _btrfs_set_prop helpers btrfs_set_prop() is a redirect to __btrfs_set_prop() with the transaction handle equal to NULL. __btrfs_set_prop() in turn passes this to do_setxattr() which then transaction is actually created. Instead merge __btrfs_set_prop() to btrfs_set_prop(), and update the caller with NULL argument. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f35f06c3 |
|
26-Mar-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not allow trimming when a fs is mounted with the nologreplay option Whan a filesystem is mounted with the nologreplay mount option, which requires it to be mounted in RO mode as well, we can not allow discard on free space inside block groups, because log trees refer to extents that are not pinned in a block group's free space cache (pinning the extents is precisely the first phase of replaying a log tree). So do not allow the fitrim ioctl to do anything when the filesystem is mounted with the nologreplay option, because later it can be mounted RW without that option, which causes log replay to happen and result in either a failure to replay the log trees (leading to a mount failure), a crash or some silent corruption. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Fixes: 96da09192cda ("btrfs: Introduce new mount option to disable tree log replay") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4ea748e1 |
|
25-Feb-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock between clone/dedupe and rename Reflinking (clone/dedupe) and rename are operations that operate on two inodes and therefore need to lock them in the same order to avoid ABBA deadlocks. It happens that Btrfs' reflink implementation always locked them in a different order from VFS's lock_two_nondirectories() helper, which is used by the rename code in VFS, resulting in ABBA type deadlocks. Btrfs' locking order: static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) { if (inode1 < inode2) swap(inode1, inode2); inode_lock_nested(inode1, I_MUTEX_PARENT); inode_lock_nested(inode2, I_MUTEX_CHILD); } VFS's locking order: void lock_two_nondirectories(struct inode *inode1, struct inode *inode2) { if (inode1 > inode2) swap(inode1, inode2); if (inode1 && !S_ISDIR(inode1->i_mode)) inode_lock(inode1); if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1) inode_lock_nested(inode2, I_MUTEX_NONDIR2); } Fix this by killing the btrfs helper function that does the double inode locking and replace it with VFS's helper lock_two_nondirectories(). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 416161db9b63e3 ("btrfs: offline dedupe") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
57a50e25 |
|
12-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove no longer needed range length checks for deduplication Comparing the content of the pages in the range to deduplicate is now done in generic_remap_checks called by the generic helper generic_remap_file_range_prep(), which takes care of ensuring we do not compare/deduplicate undefined data beyond a file's EOF (range from EOF to the next block boundary). So remove these checks which are now redundant. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
09ba3bc9 |
|
18-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: merge btrfs_find_device and find_device Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4319cd9 |
|
17-Jan-2019 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_find_device() take fs_devices as argument btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
500710d3 |
|
12-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: move duplicated nodatasum check into common reflink/dedupe helper Move the check that verifies if both inodes have checksums disabled or both have them enabled, from the clone and deduplication functions into the new common helper btrfs_remap_file_range_prep(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d00c2d9c |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not overwrite error return value in the balance ioctl If the call to btrfs_balance() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_balance() returned success or was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3a53286 |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not overwrite error return value in the device replace ioctl If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if no error happened before or a device replace operation was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0f39b605 |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: remove redundant check for swapfiles when reflinking Checking if either of the inodes corresponds to a swapfile is already performed by generic_remap_file_range_prep(), so we do not need to do it in the btrfs clone and deduplication functions. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
aa704d4e |
|
14-Dec-2018 |
YueHaibing <yuehaibing@huawei.com> |
btrfs: remove set but not used variable 'num_pages' Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/ioctl.c: In function 'btrfs_extent_same': fs/btrfs/ioctl.c:3260:6: warning: variable 'num_pages' set but not used [-Wunused-but-set-variable] It not used any more since commit 9ee8234e6220 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eee99577 |
|
14-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not overwrite error return value in the get device stats ioctl If the call to btrfs_get_dev_stats() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_get_dev_stats() returned success. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4fa99b00 |
|
14-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not overwrite error return value in scrub progress ioctl If the call to btrfs_scrub_progress() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_scrub_progress() returned success. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
06fe39ab |
|
14-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: do not overwrite scrub error with fault error in scrub ioctl If scrub returned an error and then the copy_to_user() call did not succeed, we would overwrite the error returned by scrub with -EFAULT. Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned success. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d8b55242 |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between reflink/dedupe and relocation The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between relocation and reflinking (for both cloning and deduplication) the file extents between the source and destination inodes. This happens because we no longer lock the source range anymore, and we do not lock it anymore because we wait for direct IO writes and writeback to complete early on the code path right after locking the inodes, which guarantees no other file operations interfere with the reflinking. However there is one exception which is relocation, since it replaces the byte number of file extents items in the fs tree after locking the range the file extent items represent. This is a problem because after finding each file extent to clone in the fs tree, the reflink process copies the file extent item into a local buffer, releases the search path, inserts new file extent items in the destination range and then increments the reference count for the extent mentioned in the file extent item that it previously copied to the buffer. If right after copying the file extent item into the buffer and releasing the path the relocation process updates the file extent item to point to the new extent, the reflink process ends up creating a delayed reference to increment the reference count of the old extent, for which the relocation process already created a delayed reference to drop it. This results in failure to run delayed references because we will attempt to increment the count of a reference that was already dropped. This is illustrated by the following diagram: CPU 1 CPU 2 relocation is running btrfs_clone_files() btrfs_clone() --> finds extent item in source range point to extent at bytenr X --> copies it into a local buffer --> releases path replace_file_extents() --> successfully locks the range represented by the file extent item --> replaces disk_bytenr field in the file extent item with some other value Y --> creates delayed reference to increment reference count for extent at bytenr Y --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by locking the source range before searching for the file extent items in the fs tree, since the relocation process will try to lock the range a file extent item represents before updating it with the new extent location. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f7fa1107 |
|
08-Jan-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix race between cloning range ending at eof and writeback The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between writeback and cloning a range that covers the eof extent of the source file into a destination offset that is greater then the same file's size. This happens because we now wait for writeback to complete before doing the truncation of the eof block, while previously we did the truncation and then waited for writeback to complete. This leads to a race between writeback of the truncated block and cloning the file extents in the source range, because we copy each file extent item we find in the fs root into a buffer, then release the path and then increment the reference count for the extent referred in that file extent item we copied, which can no longer exist if writeback of the truncated eof block completes after we copied the file extent item into the buffer and before we incremented the reference count. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_clone_files() btrfs_cont_expand() btrfs_truncate_block() --> zeroes part of the page containg eof, marking it for delalloc btrfs_clone() --> finds extent item covering eof, points to extent at bytenr X --> copies it into a local buffer --> releases path writeback starts btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by waiting for writeback to complete after truncating the eof block. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34a28e3d |
|
07-Dec-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: use generic_remap_file_range_prep() for cloning and deduplication Since cloning and deduplication are no longer Btrfs specific operations, we now have generic code to handle parameter validation, compare file ranges used for deduplication, clear capabilities when cloning, etc. This change makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a few bugs, such as: 1) When cloning, the destination file's capabilities were not dropped (the fstest generic/513 tests this); 2) We were not checking if the destination file is immutable; 3) Not checking if either the source or destination files are swap files (swap file support is coming soon for Btrfs); 4) System limits were not checked (resource limits and O_LARGEFILE). Note that the generic helper generic_remap_file_range_prep() does start and waits for writeback by calling filemap_write_and_wait_range(), however that is not enough for Btrfs for two reasons: 1) With compression, we need to start writeback twice in order to get the pages marked for writeback and ordered extents created; 2) filemap_write_and_wait_range() (and all its other variants) only waits for the IO to complete, but we need to wait for the ordered extents to finish, so that when we do the actual reflinking operations the file extent items are in the fs tree. This is also important due to the fact that the generic helper, for the deduplication case, compares the contents of the pages in the requested range, which might require reading extents from disk in the very unlikely case that pages get invalidated after writeback finishes (so the file extent items must be up to date in the fs tree). Since these reasons are specific to Btrfs we have to do it in the Btrfs code before calling generic_remap_file_range_prep(). This also results in a simpler way of dealing with existing delalloc in the source/target ranges, specially for the deduplication case where we used to lock all the pages first and then if we found any dealloc for the range, or ordered extent, we would unlock the pages trigger writeback and wait for ordered extents to complete, then lock all the pages again and check if deduplication can be done. So now we get a simpler approach: lock the inodes, then trigger writeback and then wait for ordered extents to complete. So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it) to eliminate duplicated code, fix a few bugs and benefit from future bug fixes done there - for example the recent clone and dedupe bugs involving reflinking a partial EOF block got a counterpart fix in the generic helper, since it affected all filesystems supporting these operations, so we no longer need special checks in Btrfs for them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de37aa51 |
|
30-Oct-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove fsid/metadata_fsid fields from btrfs_info Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cd24c69 |
|
01-Nov-2018 |
Ethan Lien <ethanlien@synology.com> |
btrfs: use tagged writepage to mitigate livelock of snapshot Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eede2bf3 |
|
03-Nov-2016 |
Omar Sandoval <osandov@fb.com> |
Btrfs: prevent ioctls from interfering with a swap file A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac765f83 |
|
05-Nov-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix data corruption due to cloning of eof block We currently allow cloning a range from a file which includes the last block of the file even if the file's size is not aligned to the block size. This is fine and useful when the destination file has the same size, but when it does not and the range ends somewhere in the middle of the destination file, it leads to corruption because the bytes between the EOF and the end of the block have undefined data (when there is support for discard/trimming they have a value of 0x00). Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ export foo_size=$((256 * 1024 + 100)) $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar $ od -A d -t x1 /mnt/bar 0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 * 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c * 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 * 1048576 The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527 (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead of 0xb5. This is similar to the problem we had for deduplication that got recently fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when deduplicating between different files"). Fix this by not allowing such operations to be performed and return the errno -EINVAL to user space. This is what XFS is doing as well at the VFS level. This change however now makes us return -EINVAL instead of -EOPNOTSUPP for cases where the source range maps to an inline extent and the destination range's end is smaller then the destination file's size, since the detection of inline extents is done during the actual process of dropping file extent items (at __btrfs_drop_extents()). Returning the -EINVAL error is done early on and solely based on the input parameters (offsets and length) and destination file's size. This makes us consistent with XFS and anyone else supporting cloning since this case is now checked at a higher level in the VFS and is where the -EINVAL will be returned from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1 by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into partial EOF block"). So this change is more geared towards stable kernels, as it's unlikely the new VFS checks get removed intentionally. A test case for fstests follows soon, as well as an update to filter existing tests that expect -EOPNOTSUPP to accept -EINVAL as well. CC: <stable@vger.kernel.org> # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11023d3f |
|
05-Nov-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix infinite loop on inode eviction after deduplication of eof block If we attempt to deduplicate the last block of a file A into the middle of a file B, and file A's size is not a multiple of the block size, we end rounding the deduplication length to 0 bytes, to avoid the data corruption issue fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when deduplicating between different files"). However a length of zero will cause the insertion of an extent state with a start value greater (by 1) then the end value, leading to a corrupt extent state that will trigger a warning and cause chaos such as an infinite loop during inode eviction. Example trace: [96049.833585] ------------[ cut here ]------------ [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs] [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1 [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs] [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282 [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006 [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0 [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000 [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8 [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78 [96049.833790] FS: 00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000 [96049.833791] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0 [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [96049.833796] Call Trace: [96049.833809] __set_extent_bit+0x46c/0x6a0 [btrfs] [96049.833823] lock_extent_bits+0x6b/0x210 [btrfs] [96049.833831] ? _raw_spin_unlock+0x24/0x30 [96049.833841] ? test_range_bit+0xdf/0x130 [btrfs] [96049.833853] lock_extent_range+0x8e/0x150 [btrfs] [96049.833864] btrfs_double_extent_lock+0x78/0xb0 [btrfs] [96049.833875] btrfs_extent_same_range+0x14e/0x550 [btrfs] [96049.833885] ? rcu_read_lock_sched_held+0x3f/0x70 [96049.833890] ? __kmalloc_node+0x2b0/0x2f0 [96049.833899] ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs] [96049.833909] btrfs_dedupe_file_range+0x270/0x280 [btrfs] [96049.833916] vfs_dedupe_file_range_one+0xd9/0xe0 [96049.833919] vfs_dedupe_file_range+0x131/0x1b0 [96049.833924] do_vfs_ioctl+0x272/0x6e0 [96049.833927] ? __fget+0x113/0x200 [96049.833931] ksys_ioctl+0x70/0x80 [96049.833933] __x64_sys_ioctl+0x16/0x20 [96049.833937] do_syscall_64+0x60/0x1b0 [96049.833939] entry_SYSCALL_64_after_hwframe+0x49/0xbe [96049.833941] RIP: 0033:0x7f5c1478ddd7 [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7 [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004 [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040 [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000 [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0 [96049.833954] irq event stamp: 6196 [96049.833956] hardirqs last enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640 [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [96049.833959] softirqs last enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421 [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0 [96049.833965] ---[ end trace db7b05f01b7fa10c ]--- [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910 [96049.935822] irq event stamp: 6584 [96049.935823] hardirqs last enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640 [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [96049.935827] softirqs last enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421 [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0 [96049.935829] ---[ end trace db7b05f01b7fa123 ]--- [96049.935840] ------------[ cut here ]------------ [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs] [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G W 4.19.0-rc7-btrfs-next-39 #1 [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs] [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282 [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006 [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0 [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000 [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8 [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48 [96049.936125] FS: 00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000 [96049.936126] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0 [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [96049.936131] Call Trace: [96049.936141] __set_extent_bit+0x46c/0x6a0 [btrfs] [96049.936154] lock_extent_bits+0x6b/0x210 [btrfs] [96049.936167] btrfs_evict_inode+0x1e1/0x5a0 [btrfs] [96049.936172] evict+0xbf/0x1c0 [96049.936174] dispose_list+0x51/0x80 [96049.936176] evict_inodes+0x193/0x1c0 [96049.936180] generic_shutdown_super+0x3f/0x110 [96049.936182] kill_anon_super+0xe/0x30 [96049.936189] btrfs_kill_super+0x13/0x100 [btrfs] [96049.936191] deactivate_locked_super+0x3a/0x70 [96049.936193] cleanup_mnt+0x3b/0x80 [96049.936195] task_work_run+0x93/0xc0 [96049.936198] exit_to_usermode_loop+0xfa/0x100 [96049.936201] do_syscall_64+0x17f/0x1b0 [96049.936202] entry_SYSCALL_64_after_hwframe+0x49/0xbe [96049.936204] RIP: 0033:0x7fe4b80cfb37 [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37 [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0 [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015 [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910 [96049.936216] irq event stamp: 6616 [96049.936219] hardirqs last enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640 [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [96049.936222] softirqs last enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421 [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0 [96049.936223] ---[ end trace db7b05f01b7fa124 ]--- The second stack trace, from inode eviction, is repeated forever due to the infinite loop during eviction. This is the same type of problem fixed way back in 2015 by commit 113e8283869b ("Btrfs: fix inode eviction infinite loop after extent_same ioctl") and commit ccccf3d67294 ("Btrfs: fix inode eviction infinite loop after cloning into it"). So fix this by returning immediately if the deduplication range length gets rounded down to 0 bytes, as there is nothing that needs to be done in such case. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar # Unmount the filesystem and mount it again so that we start without any # extent state records when we ask for the deduplication. $ umount /mnt $ mount /dev/sdb /mnt $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar # This unmount triggers the infinite loop. $ umount /mnt A test case for fstests will follow soon. Fixes: de02b9f6bb65 ("Btrfs: fix data corruption when deduplicating between different files") CC: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42ec3d4c |
|
29-Oct-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: make remap_file_range functions take and return bytes completed Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2e5dfc99 |
|
29-Oct-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
vfs: combine the clone and dedupe into a single remap_file_range Combine the clone_file_range and dedupe_file_range operations into a single remap_file_range file operation dispatch since they're fundamentally the same operation. The differences between the two can be made in the prep functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
6ba9fc8e |
|
07-Sep-2018 |
Qu Wenruo <wqu@suse.com> |
btrfs: Ensure btrfs_trim_fs can trim the whole filesystem [BUG] fstrim on some btrfs only trims the unallocated space, not trimming any space in existing block groups. [CAUSE] Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to range [0, super->total_bytes). So later btrfs_trim_fs() will only be able to trim block groups in range [0, super->total_bytes). While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs uses its logical address space, there is nothing limiting the location where we put block groups. For filesystem with frequent balance, it's quite easy to relocate all block groups and bytenr of block groups will start beyond super->total_bytes. In that case, btrfs will not trim existing block groups. [FIX] Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() can get the unmodified range, which is normally set to [0, U64_MAX]. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") CC: <stable@vger.kernel.org> # v4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
28c4a3e2 |
|
04-Sep-2018 |
Su Yue <suy.fnst@cn.fujitsu.com> |
btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag Since commit 8b62f87bad9c ("Btrfs: rework outstanding_extents"), manual operations of outstanding_extent in btrfs_inode are replaced by btrfs_mod_outstanding_extents(). The one in cluster_pages_for_defrag seems to be lost, so replace it of btrfs_mod_outstanding_extents(). Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents") Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4fd786e6 |
|
05-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Remove 'objectid' member from struct btrfs_root There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
684572df |
|
04-Aug-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove root parameter from btrfs_insert_dir_item All callers pass the root tree of dir, we can push that down to the function itself. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
de02b9f6 |
|
17-Aug-2018 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix data corruption when deduplicating between different files If we deduplicate extents between two different files we can end up corrupting data if the source range ends at the size of the source file, the source file's size is not aligned to the filesystem's block size and the destination range does not go past the size of the destination file size. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo # The first byte with a value of 0xae starts at an offset (2518890) # which is not a multiple of the sector size. $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo # Confirm the file content is full of bytes with values 0x6b and 0xae. $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae 11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # Create a second file with a length not aligned to the sector size, # whose bytes all have the value 0x6b, so that its extent(s) can be # deduplicated with the first file. $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar # Now deduplicate the entire second file into a range of the first file # that also has all bytes with the value 0x6b. The destination range's # end offset must not be aligned to the sector size and must be less # then the offset of the first byte with the value 0xae (byte at offset # 2518890). $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo # The bytes in the range starting at offset 2515659 (end of the # deduplication range) and ending at offset 2519040 (start offset # rounded up to the block size) must all have the value 0xae (and not # replaced with 0x00 values). In other words, we should have exactly # the same data we had before we asked for deduplication. $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae 11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # Unmount the filesystem and mount it again. This guarantees any file # data in the page cache is dropped. $ umount /dev/sdb $ mount /dev/sdb /mnt $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a # value of 0xae, data corruption happened due to the deduplication # operation. So fix this by rounding down, to the sector size, the length used for the deduplication when the following conditions are met: 1) Source file's range ends at its i_size; 2) Source file's i_size is not aligned to the sector size; 3) Destination range does not cross the i_size of the destination file. Fixes: e1d227a42ea2 ("btrfs: Handle unaligned length in extent_same") CC: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8ecebf4d |
|
05-Aug-2018 |
Robbie Ko <robbieko@synology.com> |
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
672d5990 |
|
02-Aug-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Use wrapper macro for rcu string to remove duplicate code Cleanup patch and no functional changes. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6025c19f |
|
31-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info from btrfs_add_root_ref It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
616d374e |
|
17-Jul-2018 |
Adam Borowski <kilobyte@angband.pl> |
btrfs: allow defrag on a file opened read-only that has rw permissions Requiring a read-write descriptor conflicts both ways with exec, returning ETXTBSY whenever you try to defrag a program that's currently being run, or causing intermittent exec failures on a live system being defragged. As defrag doesn't change the file's contents in any way, there's no reason to consider it a rw operation. Thus, let's check only whether the file could have been opened rw. Such access control is still needed as currently defrag can use extra disk space, and might trigger bugs. We return EINVAL when the request is invalid; here it's ok but merely the user has insufficient privileges. Thus, the EPERM return value reflects the error better -- as discussed in the identical case for dedupe. According to codesearch.debian.net, no userspace program distinguishes these values beyond strerror(). Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> [ fold the EPERM patch from Adam ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a9377422 |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
280f8bd2 |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f0042d5e |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3efbee1d |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49a05ecd |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
39616c27 |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9f8a6ce6 |
|
18-Jul-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
340f1aa2 |
|
05-Jul-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") not only resulted in an easier to follow code but it also introduced a subtle bug. It changed the timing when the initial transaction rescan was happening: - before the commit: it would happen after transaction commit had occured - after the commit: it might happen before the transaction was committed This results in failure to correctly rescan the quota since there could be data which is still not committed on disk. This patch aims to fix this by moving the transaction creation/commit inside btrfs_quota_enable, which allows to schedule the quota commit after the transaction has been committed. Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Link: https://marc.info/?l=linux-btrfs&m=152999289017582 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d7f663fa |
|
29-Jun-2018 |
David Sterba <dsterba@suse.com> |
btrfs: prune unused includes Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc877d28 |
|
18-Jun-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Deduplicate extent_buffer init code When a new extent buffer is allocated there are a few mandatory fields which need to be set in order for the buffer to be sane: level, generation, bytenr, backref_rev, owner and FSID/UUID. Currently this is open coded in the callers of btrfs_alloc_tree_block, meaning it's fairly high in the abstraction hierarchy of operations. This patch solves this by simply moving this init code in btrfs_init_new_buffer, since this is the function which initializes a newly allocated extent buffer. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bece2e82 |
|
20-Jun-2018 |
Bart Van Assche <bvanassche@acm.org> |
btrfs: Fix misleading indentation reported by smatch This patch avoids that building the BTRFS source code with smatch triggers complaints about inconsistent indenting. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97b19170 |
|
13-Jul-2018 |
Naohiro Aota <naota@elisp.net> |
btrfs: fix use-after-free of cmp workspace pages btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves their page address intact. Now, if you hit "goto again" in btrfs_extent_same_range() and hit some error in btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages. This is simple fix to reset the address to avoid use-after-free. Fixes: 67b07bd4bec5 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl") Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
87eb5eb2 |
|
06-Jul-2018 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: dedupe: rationalize args Clean up f_op->dedupe_file_range() interface. 1) Use loff_t for offsets and length instead of u64 2) Order the arguments the same way as {copy|clone}_file_range(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
5740c99e |
|
06-Jul-2018 |
Miklos Szeredi <mszeredi@redhat.com> |
vfs: dedupe: return int Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
#
22883ddc |
|
19-Jun-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: fix invalid-free in btrfs_extent_same If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) is hit, we will go to free the uninitialized cmp.src_pages and cmp.dst_pages. Fixes: 67b07bd4bec5 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl") Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
95582b00 |
|
08-May-2018 |
Deepa Dinamani <deepa.kernel@gmail.com> |
vfs: change inode times to use struct timespec64 struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct \( iattr \| inode \| kstat \) { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (*update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec *t, + struct timespec64 *t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec *t + struct timespec64 *t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode *inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode *node1; local idexpression struct inode *node2; local idexpression struct iattr *attr1; local idexpression struct iattr *attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; | - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) | - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) | - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) | - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) | ts = current_time(e) | fn_update_time(..., &ts,...) | inode_node->i_xtime = ts | node1->i_xtime = ts | ts = inode_node->i_xtime | <+... attr1->ia_xtime ...+> = ts | ts = attr1->ia_xtime | ts.tv_sec | ts.tv_nsec | btrfs_set_stack_timespec_sec(..., ts.tv_sec) | btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) | - ts = timespec64_to_timespec( + ts = ... -) | - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) | - ts = E3 + ts = timespec_to_timespec64(E3) | - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) | fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) | - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) | - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) | - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) | node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) | - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) | - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) | - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode *node; struct iattr *attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); | fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); | - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode *node; struct iattr *attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode *node; struct iattr *attr; struct kstat *stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); | + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode *node; struct inode *node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr *attrp; struct iattr *attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat *stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ; | node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | stat->xtime = node2->i_xtime1; | stat1.xtime = node2->i_xtime1; | ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; | ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2; | - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); | - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); | node->i_xtime1 = current_time(...); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>
|
#
3ca57bd6 |
|
04-Jun-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user The patch introducing the ioctl was not the latest version at the time of merging to the mainline and needs a fixup from this patch. Fixes: ba637a252d30 ("btrfs: Check error of btrfs_iget() in btrfs_search_path_in_tree_user") Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23d0b79d |
|
20-May-2018 |
Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> |
btrfs: Add unprivileged version of ino_lookup ioctl Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to allow normal users to call "btrfs subvolume list/show" etc. in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF. This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. This is because it always searches the fs/file tree correspoinding to the fd with which this ioctl is called and also returns the name of bottom subvolume. The main differences from original ino_lookup ioctl are: 1. Read + Exec permission will be checked using inode_permission() during path construction. -EACCES will be returned in case of failure. 2. Path construction will be stopped at the inode number which corresponds to the fd with which this ioctl is called. If constructed path does not exist under fd's inode, -EACCES will be returned. 3. The name of bottom subvolume is also searched and filled. Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) bytes than ino_lookup ioctl because of space of subvolume's name. Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
42e4b520 |
|
20-May-2018 |
Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> |
btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF information of the subvolume containing this inode except the subvolume name (this is because to prevent potential name leak). The subvolume name will be gained by user version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check. The min id of root ref's subvolume to be searched is specified by @min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller can just call this ioctl again without changing the argument to continue search. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes and struct item renames ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b64ec075 |
|
20-May-2018 |
Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> |
btrfs: Add unprivileged ioctl which returns subvolume information Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the information of subvolume containing this inode. (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.) Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ minor style fixes, update struct comments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c4c129db |
|
29-May-2018 |
Gu JinXiang <gujx@cn.fujitsu.com> |
btrfs: drop unused parameter qgroup_reserved Since commit 7775c8184ec0 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d1957791 |
|
29-May-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info argument from btrfs_uuid_tree_rem This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
cdb345a8 |
|
29-May-2018 |
Lu Fengqi <lufq.fnst@cn.fujitsu.com> |
btrfs: Remove fs_info argument from btrfs_uuid_tree_add This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd4e994b |
|
22-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2() If we have invalid flags set, when we error out we must drop our writer counter and free the buffer we allocated for the arguments. This bug is trivially reproduced with the following program on 4.7+: #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/btrfs.h> #include <linux/btrfs_tree.h> int main(int argc, char **argv) { struct btrfs_ioctl_vol_args_v2 vol_args = { .flags = UINT64_MAX, }; int ret; int fd; if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } fd = open(argv[1], O_WRONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args); if (ret == -1) perror("ioctl"); close(fd); return EXIT_SUCCESS; } When unmounting the filesystem, we'll hit the WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the filesystem to be remounted read-only as the writer count will stay lifted. Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b5c40d59 |
|
22-May-2018 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix clone vs chattr NODATASUM race In btrfs_clone_files(), we must check the NODATASUM flag while the inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags() will change the flags after we check and we can end up with a party checksummed file. The race window is only a few instructions in size, between the if and the locks which is: 3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) 3835 return -EISDIR; where the setflags must be run and toggle the NODATASUM flag (provided the file size is 0). The clone will block on the inode lock, segflags takes the inode lock, changes flags, releases log and clone continues. Not impossible but still needs a lot of bad luck to hit unintentionally. Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ad1e3d56 |
|
20-May-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctl btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or ERR_PTR(-ENOMEM) and therefore search_ioctl() and btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT, which all other callers of btrfs_read_fs_root_no_name() do. Drop the error message as it would be confusing, the caller of ioctl will likely interpret the error code and not look into the syslog. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bf5091c8 |
|
11-May-2018 |
David Sterba <dsterba@suse.com> |
btrfs: use kvzalloc for EXTENT_SAME temporary data The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the arrays can be 32KiB large. To avoid allocation failures due to fragmented memory, use the allocation with fallback to vmalloc. The arrays are allocated and freed only inside btrfs_extent_same and reused for all the ranges. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
67b07bd4 |
|
01-May-2018 |
Timofey Titovets <nefelim4ag@gmail.com> |
Btrfs: reuse cmp workspace in EXTENT_SAME ioctl We support big dedup requests by splitting range to smaller parts, and call dedupe logic on each of them. Instead of repeated allocation and deallocation, allocate once at the beginning and reuse in the iteration. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b6728768 |
|
01-May-2018 |
Timofey Titovets <nefelim4ag@gmail.com> |
Btrfs: dedupe_file_range ioctl: remove 16MiB restriction Currently btrfs_dedupe_file_range silently restricts the dedupe range to to 16MiB to limit locking and working memory size and is documented in manual page as implementation specific. Let's remove that restriction by iterating over the dedup range in 16MiB steps. This is backward compatible and will not change anything for requests smaller then 16MiB. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3973909d |
|
01-May-2018 |
Timofey Titovets <nefelim4ag@gmail.com> |
Btrfs: split btrfs_extent_same Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME entry and a helper that can be repeatedly called on a range. This will be used in following patches. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c57b8b6 |
|
23-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: unify naming of flags variables for SETFLAGS and XFLAGS * The simple 'flags' refer to the btrfs inode * ... that's in 'binode * the FS_*_FL variables are 'fsflags' * the old copies of the variable are prefixed by 'old_' * Struct inode flags contain 'i_flags'. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
025f2121 |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add FS_IOC_FSSETXATTR ioctl The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new flags and is extensible. Don't get fooled by the XATTR in the name, it does not have anything in common with the extended attributes, incidentally also abbreviated as XATTRs. This patch allows to set the xflags portion of the fsxattr structure, other items have no meaning and non-zero values will result in EOPNOTSUPP. Currently supported xflags: - APPEND - IMMUTABLE - NOATIME - NODUMP - SYNC The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but is simpler on the flag setting side. The original patch was written by Chandan Jay Sharma but was incomplete and no further revision has been sent. Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4202ac9 |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add FS_IOC_FSGETXATTR ioctl The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new flags and is extensible. This patch allows to return the xflags portion of the fsxattr structure, other items have no meaning for btrfs or can be added later. The original patch was written by Chandan Jay Sharma but was incomplete and no further revision has been sent. Several cleanups were necessary to avoid confusion with other ioctls, as we have another flavor of flags. Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19f93b3c |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: add helpers for FS_XFLAG_* conversion Preparatory work for the FS_IOC_FSGETXATTR ioctl, basic conversions and checking helpers. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a157d4fd |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_flags_to_ioctl to reflect which flags it touches Converts btrfs_inode::flags to the FS_*_FL flags. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5ba76abf |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: rename check_flags to reflect which flags it touches The FS_*_FL flags cannot be easily identified by a prefix but we still need to recognize them so the 'fsflags' should be closer to the naming scheme but again the 'fs' part sounds like it's a filesystem flag. I don't have a better idea for now. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1905a0f7 |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_mask_flags to reflect which flags it touches The FS_*_FL flags cannot be easily identified by a variable name prefix but we still need to recognize them so the 'fsflags' should be closer to the naming scheme but again the 'fs' part sounds like it's a filesystem flag. I don't have a better idea for now. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7b6a221e |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_update_iflags to reflect which flags it touches The btrfs inode flag flavour is now simply called 'inode flags' and the vfs inode are i_flags. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6fcf6e2b |
|
07-May-2018 |
David Sterba <dsterba@suse.com> |
btrfs: remove redundant btrfs_balance_control::fs_info The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
76f32e24 |
|
23-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes It's always set to 0, so just remove it and collapse the constant value to the only function we are passing it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
82b3e53b |
|
23-Apr-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots This parameter was introduced alongside the function in eb73c1b7cea7 ("Btrfs: introduce per-subvolume delalloc inode list") to avoid deadlocks since this function was used in the transaction commit path. However, commit 8d875f95da43 ("btrfs: disable strict file flushes for renames and truncates") removed that usage, rendering the parameter obsolete. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
008ef096 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3009a62f |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dccdb07b |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
149196a2 |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup helpers that reset balance state The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a17c95df |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move clearing of EXCL_OP out of __cancel_balance Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
72b81abf |
|
20-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: move volume_mutex to callers of btrfs_rm_device Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f60a2364 |
|
17-Apr-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() Factor out the second half of btrfs_ioctl_snap_destroy() as btrfs_delete_subvolume(), which performs some subvolume specific checks before deletion: 1. send is not in progress 2. the subvolume is not the default subvolume 3. the subvolume does not contain other subvolumes and actual deletion process. btrfs_delete_subvolume() requires inode_lock for both @dir and inode of @dentry. The remaining part of btrfs_ioctl_snap_destroy() is mainly permission checks. Note that call of d_delete() is not included in btrfs_delete_subvolume() as this function will also be used by btrfs_rmdir() to delete an empty subvolume and in that case d_delete() is called in VFS layer. As a result, btrfs_unlink_subvol() and may_destroy_subvol() become static functions. No functional changes. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ec42f167 |
|
17-Apr-2018 |
Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> |
btrfs: Move may_destroy_subvol() from ioctl.c to inode.c This is a preparation work to refactor btrfs_ioctl_snap_destroy() and to allow rmdir(2) to delete an empty subvolume. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of the function comment ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c065f5b1 |
|
02-Apr-2018 |
Su Yue <suy.fnst@cn.fujitsu.com> |
btrfs: rename btrfs_get_block_group_info and make it static The function btrfs_get_block_group_info() was introduced by the commit 5af3e8cce8b7 ("Btrfs: make filesystem read-only when submitting barrier fails") which used it in disk-io.c. However, the function is only called in ioctl.c now. Its parameter type btrfs_ioctl_space_info* is only for ioctl. So, make it static and rename it to be original name get_block_group_info. No functional change. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c1d7c514 |
|
03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- sources Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
38e82de8 |
|
26-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: user proper type for btrfs_mask_flags flags All users pass a local unsigned int and not the __uXX types that are supposed to be used for userspace interfaces. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43b18595 |
|
12-Dec-2017 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Use separate meta reservation type for delalloc Before this patch, btrfs qgroup is mixing per-transcation meta rsv with preallocated meta rsv, making it quite easy to underflow qgroup meta reservation. Since we have the new qgroup meta rsv types, apply it to delalloc reservation. Now for delalloc, most of its reserved space will use META_PREALLOC qgroup rsv type. And for callers reducing outstanding extent like btrfs_finish_ordered_io(), they will convert corresponding META_PREALLOC reservation to META_PERTRANS. This is mainly due to the fact that current qgroup numbers will only be updated in btrfs_commit_transaction(), that's to say if we don't keep such placeholder reservation, we can exceed qgroup limitation. And for callers freeing outstanding extent in error handler, we will just free META_PREALLOC bytes. This behavior makes callers of btrfs_qgroup_release_meta() or btrfs_qgroup_convert_meta() to be aware of which type they are. So in this patch, btrfs_delalloc_release_metadata() and its callers get an extra parameter to info qgroup to do correct meta convert/release. The good news is, even we use the wrong type (convert or free), it won't cause obvious bug, as prealloc type is always in good shape, and the type only affects how per-trans meta is increased or not. So the worst case will be at most metadata limitation can be sometimes exceeded (no convert at all) or metadata limitation is reached too soon (no free at all). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d87ff758 |
|
12-Mar-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Handle error from btrfs_uuid_tree_rem call in _btrfs_ioctl_set_received_subvol As with every function which deals with modifying the btree btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM). Handle return error value from this function gracefully by aborting the transaction. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7a5a07a8 |
|
05-Feb-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove userspace transaction ioctls Commit 3558d4f88ec8 ("btrfs: Deprecate userspace transaction ioctls") marked the beginning of the end of userspace transaction. This commit finishes the job! There are no known users and ceph does not use the ioctl anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Sage Weil <sage@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7c829b72 |
|
07-Mar-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: add define for oldest generation Some functions can filter metadata by the generation. Add a define that will annotate such arguments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
17d202b9 |
|
12-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename __btrfs_dev_replace_cancel() Remove __ which is for the special functions. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
97282031 |
|
12-Feb-2018 |
Anand Jain <anand.jain@oracle.com> |
btrfs: open code btrfs_dev_replace_cancel() btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the actual cancel so just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4625956a |
|
15-Mar-2018 |
Peter Zijlstra <peterz@infradead.org> |
sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
ae5e165d |
|
29-Jan-2018 |
Jeff Layton <jlayton@kernel.org> |
fs: new API for handling inode->i_version Add a documentation blob that explains what the i_version field is, how it is expected to work, and how it is currently implemented by various filesystems. We already have inode_inc_iversion. Add several other functions for manipulating and accessing the i_version counter. For now, the implementation is trivial and basically works the way that all of the open-coded i_version accesses work today. Future patches will convert existing users of i_version to use the new API, and then convert the backend implementation to do things more efficiently. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
|
#
6670d4c2 |
|
08-Jan-2018 |
Xiongfeng Wang <xiongfeng.wang@linaro.org> |
btrfs: use correct string length in DEV_INFO ioctl gcc-8 reports: fs/btrfs/ioctl.c: In function 'btrfs_ioctl': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified bound 1024 equals destination size [-Wstringop-truncation] We need one less byte or call strlcpy() to make it a nul-terminated string. This is done on the next line anyway, but we want to avoid the warning. Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e43bbe5e |
|
12-Dec-2017 |
David Sterba <dsterba@suse.com> |
btrfs: sink unlock_extent parameter gfp_flags All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
93370509 |
|
31-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: SETFLAGS ioctl: use helper for compression type conversion Signed-off-by: David Sterba <dsterba@suse.com>
|
#
401e29c1 |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebbede42 |
|
03-Dec-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ae0f1625 |
|
31-Oct-2017 |
David Sterba <dsterba@suse.com> |
btrfs: sink gfp parameter to clear_extent_bit All callers use GFP_NOFS, we don't have to pass it as an argument. The built-in tests pass GFP_KERNEL, but they run only at module load time and NOFS works there as well. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d03262c7 |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: switch to RCU for device traversal in btrfs_ioctl_fs_info We don't need to use the mutex as we do not modify the devices nor the list itself and just read information about device counts. Move copying fsid out of the protected section, not applicable to RCU same as the rest of the retrieved information. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5593ca3 |
|
15-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: switch to RCU for device traversal in btrfs_ioctl_dev_info We don't need to use the mutex as we do not modify the devices nor the list itself and just read some information: does not change during device lifetime: - devid - uuid - name (ie. the path) may change in parallel to the ioctl call, but can lead only to reporting inacurracy: - bytes_used - total_bytes Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2c997384 |
|
05-Nov-2017 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move volume_mutex into the btrfs_rm_device() A cleanup patch no functional change, we hold volume_mutex before calling btrfs_rm_device, so move it into the function itself. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c8bcbfbd |
|
01-Dec-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix possible off-by-one in btrfs_search_path_in_tree The name char array passed to btrfs_search_path_in_tree is of size BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes are in the range of [0, 4079]. Currently the code uses the define but this represents an off-by-one. Implications: Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be written to extra space, not some padding that could be provided by the allocator. btrfs-progs store the arguments on stack, but kernel does own copy of the ioctl buffer and the off-by-one overwrite does not affect userspace, but the ending 0 might be lost. Kernel ioctl buffer is allocated dynamically so we're overwriting somebody else's memory, and the ioctl is privileged if args.objectid is not 256. Which is in most cases, but resolving a subvolume stored in another directory will trigger that path. Before this patch the buffer was one byte larger, but then the -1 was not added. Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added implications ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1751e8a6 |
|
27-Nov-2017 |
Linus Torvalds <torvalds@linux-foundation.org> |
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
8b62f87b |
|
19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: rework outstanding_extents Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b115e3bc |
|
22-Sep-2017 |
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> |
btrfs: increase output size for LOGICAL_INO_V2 ioctl Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d24a67b2 |
|
22-Sep-2017 |
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> |
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2 Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the existing reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. The lack of a zero check also means that new programs have no way to know whether the kernel is honoring the flags field. To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can use the same argument layout as LOGICAL_INO, but shorten the reserved[] array by one element and turn it into the 'flags' field. The V2 ioctl explicitly checks that reserved fields and unsupported flag bits are zero so that userspace can negotiate future feature bits as they are defined. Since the memory layouts of the two ioctls' arguments are compatible, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search where the layout and code are quite different). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Motivation and background, copied from the patchset cover letter: Suppose we have a file with one extent: root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a root@tester:~# sync Split the extent by overwriting it in the middle: root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a We should now have 3 extent refs to 2 extents, with one block unreachable. The extent tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 2 [...] item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53 extent refs 2 gen 29 flags DATA extent data backref root 5 objectid 261 offset 0 count 2 [...] item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53 extent refs 1 gen 30 flags DATA extent data backref root 5 objectid 261 offset 8192 count 1 [...] and the ref tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 5 [...] item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 0 nr 8192 ram 73728 extent compression(none) item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53 extent data disk byte 1103175680 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 12288 nr 61440 ram 73728 extent compression(none) [...] There are two references to the same extent with different, non-overlapping byte offsets: [------------------72K extent at 1103101952----------------------] [--8K----------------|--4K unreachable----|--60K-----------------] ^ ^ | | [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--] | v [-----4K extent-----] at 1103175680 We want to find all of the references to extent bytenr 1103101952. Without the patch (and without running btrfs-debug-tree), we have to do it with 18 LOGICAL_INO calls: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO inode 261 offset 0 root 5 root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode inode 261 offset 0 root 5 inode 261 offset 4096 root 5 <- same extent ref as offset 0 (offset 8192 returns empty set, not reachable) inode 261 offset 12288 root 5 inode 261 offset 16384 root 5 \ inode 261 offset 20480 root 5 | inode 261 offset 24576 root 5 | inode 261 offset 28672 root 5 | inode 261 offset 32768 root 5 | inode 261 offset 36864 root 5 \ inode 261 offset 40960 root 5 > all the same extent ref as offset 12288. inode 261 offset 45056 root 5 / More processing required in userspace inode 261 offset 49152 root 5 | to figure out these are all duplicates. inode 261 offset 53248 root 5 | inode 261 offset 57344 root 5 | inode 261 offset 61440 root 5 | inode 261 offset 65536 root 5 | inode 261 offset 69632 root 5 / In the worst case the extents are 128MB long, and we have to do 32768 iterations of the loop to find one 4K extent ref. With the patch, we just use one call to map all refs to the extent at once: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO_V2 inode 261 offset 0 root 5 inode 261 offset 12288 root 5 The TREE_SEARCH ioctl allows userspace to retrieve the offset and extent bytenr fields easily once the root, inode and offset are known. This is sufficient information to build a complete map of the extent and all of its references. Userspace can use this information to make better choices to dedup or defrag. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> [ copy background and motivation from cover letter ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c995ab3c |
|
22-Sep-2017 |
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> |
btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> [ minor coding style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
84f7d8e6 |
|
29-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: pass root to various extent ref mod functions We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2351f431 |
|
27-Sep-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix send ioctl on 32bit with 64bit kernel We pass in a pointer in our send arg struct, this means the struct size doesn't match with 32bit user space and 64bit kernel space. Fix this by adding a compat mode and doing the appropriate conversion. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ move structure to the beginning, next to receive 32bit compat ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
80e03a2c |
|
07-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: do not make defrag wait on async_delalloc_pages By setting compression for a defrag task, the task will start IO at the end of defrag. After the combo of filemap_flush(), we've already made sure that dirty pages have made progress via async compress thread because the second filemap_flush() will wait for page lock, which won't be unlocked until those pages have been marked as writeback and ordered extents have been queued. And this is for per-inode defrag, it's not helpful to wait on a global %async_delalloc_pages and %nr_async_submits from fs_info. Although waiting on %nr_async_submits means that all bios are submitted down to per-device schedule IO lists, it doesn't wait for their completions, thus users still need to do fsync/sync to make sure the data is on disk. While with this change, it makes sure that pages are marked with writeback bits and will be submitted asynchronously shortly, therefore, the behavior of defrag option '-c' remains unchanged. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
efd38150 |
|
28-Sep-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Refactor transaction handling in received subvolume ioctl If btrfs_transaction_commit fails it will proceed to call cleanup_transaction, which in turn already does btrfs_abort_transaction. So let's remove the unnecessary code duplication. Also let's be explicit about handling failure of btrfs_uuid_tree_add by calling btrfs_end_transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9417ebc8 |
|
28-Sep-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Explicitly handle btrfs_update_root failure btrfs_udpate_root can fail and it aborts the transaction, the correct way to handle an aborted transaction is to explicitly end with btrfs_end_transaction. Even now the code is correct since btrfs_commit_transaction would handle an aborted transaction but this is more of an implementation detail. So let's be explicit in handling failure in btrfs_update_root. Furthermore btrfs_commit_transaction can also fail and by ignoring it's return value we could have left the in-memory copy of the root item in an inconsistent state. So capture the error value which allows us to correctly revert the RO/RW flags in case of commit failure. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
315d8e98 |
|
19-Sep-2017 |
Colin Ian King <colin.king@canonical.com> |
btrfs: make array types static const, reduces object code size Don't populate the read-only array types on the stack, instead make it static const. Makes the object code smaller by nearly 60 bytes: Before: text data bss dec hex filename 90536 6552 64 97152 17b80 fs/btrfs/ioctl.o After: text data bss dec hex filename 90414 6616 64 97094 17b46 fs/btrfs/ioctl.o Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
718dc5fa |
|
23-Aug-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix __user casting in ioctl.c Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
45bac0f3 |
|
01-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: use wait_event instead of a single function Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the same job. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69cc7151 |
|
01-Sep-2017 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: move finish_wait out of the loop If we're still going to wait after schedule(), we don't have to do finish_wait() to remove our %wait_queue_entry since prepare_to_wait() won't add the same %wait_queue_entry twice. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
78ad4ce0 |
|
08-Sep-2017 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: propagate error to btrfs_cmp_data_prepare caller btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors from gather_extent_pages(). While the pages are freed by btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then, btrfs_extent_same() try to access the already freed pages causing faults (or violates PageLocked assertion). This patch just return the error as is so that the caller stop the process. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage") Cc: <stable@vger.kernel.org> # 4.2 Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6d6d2829 |
|
12-Sep-2017 |
satoru takeuchi <satoru.takeuchi@gmail.com> |
btrfs: prevent to set invalid default subvolid `btrfs sub set-default` succeeds to set an ID which isn't corresponding to any fs/file tree. If such the bad ID is set to a filesystem, we can't mount this filesystem without specifying `subvol` or `subvolid` mount options. Fixes: 6ef5ed0d386b ("Btrfs: add ioctl and incompat flag to set the default mount subvol") Cc: <stable@vger.kernel.org> Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bea7eafd |
|
23-Aug-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO fs_info->super_copy->{node,sector}size are little-endian, but the ioctl should return the values in native endianness. Use the cached values in btrfs_fs_info instead. Found with sparse. Fixes: 80a773fbfc2d ("btrfs: retrieve more info from FS_INFO ioctl") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c59efa7e |
|
04-Aug-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2 The buffer passed to btrfs_ioctl_tree_search* functions have to be at least sizeof(struct btrfs_ioctl_search_header). If this is not the case then the ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW error with ->buf_size being set to the value passed by user space. Fix this by removing the size check and relying on search_ioctl, which already includes it and correctly sets buf_size. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23b5ec74 |
|
24-Jul-2017 |
Josef Bacik <jbacik@fb.com> |
btrfs: fix readdir deadlock with pagefault Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1e2ef46d |
|
17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: defrag: cleanup checking for compression status Signed-off-by: David Sterba <dsterba@suse.com>
|
#
eec63c65 |
|
17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: separate defrag and property compression Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called. The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug. Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b52aa8c9 |
|
17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: rename variable holding per-inode compression type This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line). We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d3c0bab5 |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove trivial wrapper btrfs_force_ra It's a simple call page_cache_sync_readahead, same arguments in the same order. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ea14b57f |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: fix spelling of snapshotting Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3558d4f8 |
|
26-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Deprecate userspace transaction ioctls Userspace transactions were introduced in commit 6bf13c0cc833 ("Btrfs: transaction ioctls") to provide semantics that Ceph's object store required. However, things have changed significantly since then, to the point where btrfs is no longer suitable as a backend for ceph and in fact it's actively advised against such usages. Considering this, there doesn't seem to be a widespread, legit use case of userspace transaction. They also clutter the file->private pointer. So to end the agony let's nuke the userspace transaction ioctls. As a first step let's give time for people to voice their objection by just WARN()ining when the userspace transaction is used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ move the warning past perm checks, keep the has-been-printed state; we're ok with just one warning over all filesystems ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0a52d108 |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: defrag: make readahead state allocation failure non-fatal All sorts of readahead errors are not considered fatal. We can continue defragmentation without it, with some potential slow down, which will last only for the current inode. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
63e727ec |
|
21-Jun-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL in btrfs_defrag_file We can safely use GFP_KERNEL, the function is called from two contexts: - ioctl handler, called directly, no locks taken - cleaner thread, running all queued defrag work, outside of any locks Signed-off-by: David Sterba <dsterba@suse.com>
|
#
47f08b96 |
|
18-Jul-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Use explicit round_down macro in btrfs resize ioctl handler No functional changes, just make the code more self-explanatory. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
19aee8de |
|
18-Jul-2017 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: btrfs_inherit_iflags() can be static btrfs_new_inode() is the only consumer move it to inode.c, from ioctl.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5c1aab1d |
|
09-Aug-2017 |
Nick Terrell <terrelln@fb.com> |
btrfs: Add zstd support Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |---------|-------|------------------|---------------------| | None | 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib | 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | |--------|-----------|---------------|----------|------------|----------| | None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 | | lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 | | zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick Terrell <terrelln@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
bc98a42c |
|
17-Jul-2017 |
David Howells <dhowells@redhat.com> |
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
6374e57a |
|
23-Jun-2017 |
Chris Mason <clm@fb.com> |
btrfs: fix integer overflow in calc_reclaim_items_nr Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af244 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
bc42bda2 |
|
27-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
364ecf36 |
|
27-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f54de068 |
|
31-May-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL in init_ipath Now that init_ipath is called either from a safe context or with memalloc_nofs protection, we can switch to GFP_KERNEL allocations in init_path and init_data_container. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b349dfe |
|
07-May-2017 |
Daichou <tommy0705c@gmail.com> |
Btrfs: remove obsolete FIXMEs in qgroup ioctls These FIXMEs were already addressed in 2013. All functions check for qgroup existence: * btrfs_add_qgroup_relation * btrfs_ioctl_qgroup_create * btrfs_limit_qgroup * btrfs_del_qgroup_relation Signed-off-by: Daichou <tommy0705c@gmail.com> [ enhance and reformat changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
752ade68 |
|
08-May-2017 |
Michal Hocko <mhocko@suse.com> |
treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
171938e5 |
|
28-Mar-2017 |
David Sterba <dsterba@suse.com> |
btrfs: track exclusive filesystem operation in flags There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
52f75f4f |
|
14-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: constify name of subvolume in creation helpers Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc4f21b1 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make get_extent_t take btrfs_inode In addition to changing the signature, this patch also switches all the functions which are used as an argument to also take btrfs_inode. Namely those are: btrfs_get_extent and btrfs_get_extent_filemap. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a2f392e4 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make clone_update_extent_map take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9cdc5124 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
dcdbc059 |
|
20-Feb-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_drop_extent_cache take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ef06d27 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_i_size_write take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
877574e2 |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_set_inode_index take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8e7611cf |
|
20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_insert_dir_item take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b1517622 |
|
21-Feb-2017 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix deadlock between dedup on same file and starting writeback If we are deduping two ranges of the same file we need to make sure that we lock all pages in ascending order, that is, lock first the pages from the range with lower offset and then the pages from the other range, as otherwise we can deadlock with a concurrent task that is starting delalloc (writeback). Example trace: [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds. [74073.053889] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.056696] kworker/u32:10 D 0 17997 2 0x00000000 [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176) [74073.061370] ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240 [74073.064784] ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000 [74073.068386] ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240 [74073.071712] Call Trace: [74073.072884] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.075395] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.077511] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.079440] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.081637] [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14 [74073.083809] [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197 [74073.086314] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.100654] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.102619] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.104771] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.106969] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.108954] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.110981] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.112833] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.115010] [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs] [74073.116999] [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs] [74073.119243] [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs] [74073.121636] [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs] [74073.124229] [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs] [74073.126372] [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs] [74073.129371] [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs] [74073.131440] [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs] [74073.134303] [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.143911] [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae [74073.145787] [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7 [74073.147452] [<ffffffff811b60cd>] wb_workfn+0x194/0x37d [74073.149084] [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d [74073.150726] [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4 [74073.152694] [<ffffffff8106cf96>] process_one_work+0x273/0x4e4 [74073.154452] [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca [74073.156138] [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6 [74073.157837] [<ffffffff81072a81>] kthread+0xd5/0xdd [74073.159339] [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a [74073.161088] [<ffffffff814c6257>] ret_from_fork+0x27/0x40 [74073.162680] INFO: lockdep is turned off. [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds. [74073.181180] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.185296] fdm-stress D 0 30264 29974 0x00000000 [74073.186810] ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00 [74073.188998] ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000 [74073.191070] ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00 [74073.193286] Call Trace: [74073.193990] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.195418] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.196796] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.198163] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.199621] [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf [74073.201100] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.202686] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.204051] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.205585] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.207123] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.208238] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.208871] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.209430] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.210101] [<ffffffff8112b800>] lock_page+0x2f/0x32 [74073.210636] [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153 [74073.211270] [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs] [74073.212166] [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs] [74073.213257] [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221 [74073.214086] [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600 [74073.214767] [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d [74073.215619] [<ffffffff811a7953>] ? __fget+0x6b/0x77 [74073.216338] [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79 [74073.217149] [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad [74073.218102] [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14 [74073.218968] [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa [74073.219938] INFO: lockdep is turned off. What happened was the following: CPU 1 CPU 2 btrfs_dedupe_file_range() --> using same inode as source and target --> src range is [768K, 1Mb[ --> dst range is [0, 256K[ btrfs_cmp_data_prepare() --> calls gather_extent_pages() for range [768K, 1Mb[ and locks all pages in that range do_writepages() btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() find_lock_delalloc_range() --> finds range [0, 1Mb[ lock_delalloc_pages() --> locks all pages in the range [0, 768K[ --> tries to lock page at offset 768K --> deadlock --> calls gather_extent_pages() to lock pages in the range [0, 256K[ --> deadlock, task at CPU 1 already locked that range and it's trying to lock the range we locked previously So fix this by making sure that during a dedup we always lock first the pages from the range with lower offset. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4a0ab9d7 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from clone_copy_inline_extent Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
1a287cfe |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameters from btrfs_cmp_data After the page locking has been reworked, we get all pages prepared via cmp_pages. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
61d7e4cb |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from create_snapshot The name parameters have never been used, as the name is passed via the dentry. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7775c818 |
|
10-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused parameter from btrfs_subvolume_release_metadata Unused since qgroup refactoring that split data and metadata accounting, the btrfs_qgroup_free helper. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
23269bf5 |
|
13-Feb-2017 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL in create_snapshot We don't need to use GFP_NOFS here as this is called from ioctls an the only lock held is the subvol_sem, which is of a high level and protects creation/renames/deletion and is never held in the writeout paths. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fc4badd9 |
|
18-Jan-2017 |
Omar Sandoval <osandov@fb.com> |
Btrfs: refactor btrfs_extent_same() slightly This was originally a prep patch for changing the behavior on len=0, but we went another direction with that. This still makes the function slightly easier to follow. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
43663557 |
|
17-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4a0cc7ca |
|
10-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_ino take a struct btrfs_inode Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8c3e6b1f |
|
21-Dec-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: btrfs_defrag_root() doesn't defrag extent root tree Since btrfs_defrag_leaves() does not support extent_root, remove its corresponding call. The user can use the file based defrag to defrag extents as of now. No change in behaviour as extent_root is explicitly skipped in btrfs_defrag_leaves and this has never worked as expected. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ ehnance changelong ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
50d0446e |
|
15-Dec-2016 |
Seraphime Kirkovski <kirkseraph@gmail.com> |
Btrfs: code cleanup min/max -> min_t/max_t This cleans up the cases where the min/max macros were used with a cast rather than using directly min_t/max_t. Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2a362249 |
|
06-Feb-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls Commit 4c63c2454ef incorrectly assumed that returning -ENOIOCTLCMD would cause the native ioctl to be called. The ->compat_ioctl callback is expected to handle all ioctls, not just compat variants. As a result, when using 32-bit userspace on 64-bit kernels, everything except those three ioctls would return -ENOTTY. Fixes: 4c63c2454ef ("btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl") Cc: stable@vger.kernel.org Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a76b5b04 |
|
09-Dec-2016 |
Christoph Hellwig <hch@lst.de> |
fs: try to clone files first in vfs_copy_file_range A clone is a perfectly fine implementation of a file copy, so most file systems just implement the copy that way. Instead of duplicating this logic move it to the VFS. Currently btrfs and XFS implement copies the same way as clones and there is no behavior change for them, cifs only implements clones and grow support for copy_file_range with this patch. NFS implements both, so this will allow copy_file_range to work on servers that only implement CLONE and be lot more efficient on servers that implements CLONE and COPY. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
3a45bb20 |
|
09-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: remove root parameter from transaction commit/end routines Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2ff7e61e |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0b246afa |
|
22-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
da17066c |
|
15-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
5112febb |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_init_new_device should use fs_info->dev_root btrfs_init_new_device only uses the root passed in via the ioctl to start the transaction. Nothing else that happens is related to whatever root the user used to initiate the ioctl. We can drop the root requirement and just use fs_info->dev_root instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6bccf3ab |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: call functions that always use the same root with fs_info instead There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
92872094 |
|
20-Nov-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
constify btrfs_mksubvol() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7b9ea627 |
|
10-Nov-2016 |
Shailendra Verma <shailendra.v@samsung.com> |
btrfs: return early from failed memory allocations in ioctl handlers There is no need to call kfree() if memdup_user() fails, as no memory was allocated and the error in the error-valued pointer should be returned. Signed-off-by: Shailendra Verma <shailendra.v@samsung.com> [ edit subject ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b159fa28 |
|
08-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove constant parameter to memset_extent_buffer and rename it The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
d24ee97b |
|
09-Nov-2016 |
David Sterba <dsterba@suse.com> |
btrfs: use new helpers to set uuids in eb Signed-off-by: David Sterba <dsterba@suse.com>
|
#
926b9233 |
|
05-Oct-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused headers, statfs.h Signed-off-by: David Sterba <dsterba@suse.com>
|
#
69ae5e44 |
|
12-Oct-2016 |
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> |
btrfs: make file clone aware of fatal signals Indeed this just make the behavior similar to xfs when process has fatal signals pending, and it'll make fstests/generic/298 happy. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c2050a45 |
|
14-Sep-2016 |
Deepa Dinamani <deepa.kernel@gmail.com> |
fs: Replace current_fs_time() with current_time() current_fs_time() uses struct super_block* as an argument. As per Linus's suggestion, this is changed to take struct inode* as a parameter instead. This is because the function is primarily meant for vfs inode timestamps. Also the function was renamed as per Arnd's suggestion. Change all calls to current_fs_time() to use the new current_time() function instead. current_fs_time() will be deleted. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5d163e0e |
|
20-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
325c50e3 |
|
21-Sep-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: ensure that file descriptor used with subvol ioctls is a dir If the subvol/snapshot create/destroy ioctls are passed a regular file with execute permissions set, we'll eventually Oops while trying to do inode->i_op->lookup via lookup_one_len. This patch ensures that the file descriptor refers to a directory. Fixes: cb8e70901d (Btrfs: Fix subvolume creation locking rules) Fixes: 76dda93c6a (Btrfs: add snapshot/subvolume destroy ioctl) Cc: <stable@vger.kernel.org> #v2.6.29+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d06f23d6 |
|
08-Aug-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: waiting on qgroup rescan should not always be interruptible We wait on qgroup rescan completion in three places: file system shutdown, the quota disable ioctl, and the rescan wait ioctl. If the user sends a signal while we're waiting, we continue happily along. This is expected behavior for the rescan wait ioctl. It's racy in the shutdown path but mostly works due to other unrelated synchronization points. In the quota disable path, it Oopses the kernel pretty much immediately. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
66642832 |
|
10-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_abort_transaction, drop root parameter __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
df397565 |
|
21-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: copy_to_sk drop unused root parameter The root parameter for copy_to_sk is not used at all. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3cdde224 |
|
09-Jun-2016 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
00235411 |
|
25-May-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
restore killability of old mutex_lock_killable(&inode->i_mutex) users The ones that are taking it exclusive, that is... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
01327610 |
|
19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
578def7c |
|
26-Apr-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: don't wait for unrelated IO to finish before relocation Before the relocation process of a block group starts, it sets the block group to readonly mode, then flushes all delalloc writes and then finally it waits for all ordered extents to complete. This last step includes waiting for ordered extents destinated at extents allocated in other block groups, making us waste unecessary time. So improve this by waiting only for ordered extents that fall into the block group's range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
|
#
58d7bbf8 |
|
04-May-2016 |
David Sterba <dsterba@suse.cz> |
btrfs: ioctl: reorder exclusive op check in RM_DEV Move the op exclusivity check before the other code (same as in ADD_DEV). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
7ab19625 |
|
04-May-2016 |
David Sterba <dsterba@suse.cz> |
btrfs: add write protection to SET_FEATURES ioctl Perform the want_write check if we get far enough to do any writes. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
49a3c4d9 |
|
24-Mar-2016 |
David Sterba <dsterba@suse.com> |
btrfs: use dynamic allocation for root item in create_subvol The size of root item is more than 400 bytes, which is quite a lot of stack space. As we do IO from inside the subvolume ioctls, we should keep the stack usage low in case the filesystem is on top of other layers (NFS, device mapper, iscsi, etc). Reviewed-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
15351955 |
|
11-Apr-2016 |
David Sterba <dsterba@suse.com> |
btrfs: clone: use vmalloc only as fallback for nodesize bufer Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2355ac84 |
|
28-Apr-2016 |
David Sterba <dsterba@suse.cz> |
btrfs: ioctl: reorder exclusive op check in RM_DEV Move the op exclusivity check before the other code (same as in ADD_DEV). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
9902af79 |
|
15-Apr-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
parallel lookups: actual switch to rwsem ta-da! The main issue is the lack of down_write_killable(), so the places like readdir.c switched to plain inode_lock(); once killable variants of rwsem primitives appear, that'll be dealt with. lockdep side also might need more work Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ad8403df |
|
09-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: pass the right error code to the btrfs_std_error Also drop the newline from the message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
018ed4f7 |
|
26-Apr-2016 |
David Sterba <dsterba@suse.com> |
btrfs: sink gfp parameter to set_extent_defrag Single caller passes GFP_NOFS. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b5255456 |
|
24-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: refactor btrfs_dev_replace_start for reuse A refactor patch, and avoids user input verification in the btrfs_dev_replace_start(), and so this function can be reused. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
735654ea |
|
15-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: rename flags for vol args v2 Rename BTRFS_DEVICE_BY_ID so it's more descriptive that we specify the device by id, it'll be part of the public API. The mask of supported flags is also renamed, only for internal use. The error code for unknown flags is EOPNOTSUPP, fixed. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6b526ed7 |
|
12-Feb-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: introduce device delete by devid This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as an user argument. The patch won't delete the old ioctl interface and so kernel remains backward compatible with user land progs. Test case/script: echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk mount /dev/sdd /btrfs dmsetup suspend bad_disk echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk dmsetup resume bad_disk echo "bad disk failed. now deleting/replacing" btrfs dev del 3 /btrfs echo $? btrfs fi show /btrfs umount /btrfs btrfs-show-super /dev/sdd | egrep num_device dmsetup remove bad_disk wipefs -a /dev/sdf Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Martin <m_btrfs@ml1.co.uk> [ adjust messages, s/disk/device/ ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4c63c245 |
|
29-Oct-2015 |
Luke Dashjr <luke@dashjr.org> |
btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl 32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr fail. Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
13f48dc9 |
|
14-Mar-2016 |
Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> |
btrfs: Simplify conditions about compress while mapping btrfs flags to inode flags Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
34d97007 |
|
16-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: rename btrfs_std_error to btrfs_handle_fs_error btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
|
#
09cbfeaf |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c79b4713 |
|
25-Mar-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: don't use src fd for printk The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ebb8765b |
|
10-Mar-2016 |
Anand Jain <anand.jain@oracle.com> |
btrfs: move btrfs_compression_type to compression.h So that its better organized. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f4dfe687 |
|
12-Feb-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix extent_same allowing destination offset beyond i_size When using the same file as the source and destination for a dedup (extent_same ioctl) operation we were allowing it to dedup to a destination offset beyond the file's size, which doesn't make sense and it's not allowed for the case where the source and destination files are not the same file. This made de deduplication operation successful only when the source range corresponded to a hole, a prealloc extent or an extent with all bytes having a value of 0x00. This was also leaving a file hole (between i_size and destination offset) without the corresponding file extent items, which can be reproduced with the following steps for example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar wrote 404990/404990 bytes at offset 304457 395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec) $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792 Deduping 2 total files (28672, 24576): /mnt/sdi/foobar (929792, 24576): /mnt/sdi/foobar 1 files asked to be deduped i: 0, status: 0, bytes_deduped: 24576 24576 total bytes deduped in this operation $ umount /mnt/sdi $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: 98c528aa-0833-427d-9403-b98032ffbf9d checking extents checking free space cache checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 712704, len: 217088 found 540673 bytes used err is 1 total csum bytes: 400 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123675 file data blocks allocated: 671744 referenced 671744 btrfs-progs v4.2.3 So fix this by not allowing the destination to go beyond the file's size, just as we do for the same where the source and destination files are not the same. A test for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2be63d5c |
|
12-Feb-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix file loss on log replay after renaming a file and fsync We have two cases where we end up deleting a file at log replay time when we should not. For this to happen the file must have been renamed and a directory inode must have been fsynced/logged. Two examples that exercise these two cases are listed below. Case 1) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/c $ touch /mnt/a/b/foo $ sync $ mv /mnt/a/b/foo /mnt/c/ # Create file bar just to make sure the fsync on directory a/ does # something and it's not a no-op. $ touch /mnt/a/bar $ xfs_io -c "fsync" /mnt/a < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file foo. Case 2) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/foo $ ln /mnt/a/foo /mnt/b/foo_link $ touch /mnt/b/bar $ sync $ unlink /mnt/b/foo_link $ mv /mnt/b/bar /mnt/c/ $ xfs_io -c "fsync" /mnt/a/foo < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file bar. The reason why the files are deleted is because when we log inodes other then the fsync target inode, we ignore their last_unlink_trans value and leave the log without enough information to later replay the rename operations. So we need to look at the last_unlink_trans values and fallback to a transaction commit if they are greater than the id of the last committed transaction. So fix this by looking at the last_unlink_trans values and fallback to transaction commits when needed. Also, when logging other inodes (for case 1 we logged descendants of the fsync target inode while for case 2 we logged ascendants) we need to care about concurrent tasks updating the last_unlink_trans of inodes we are logging (which was already an existing problem in check_parent_dirs_for_sync()). Since we can not acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes deadlocks with other concurrent operations that acquire the i_mutex of 2 inodes (other fsyncs or renames for example), we need to serialize on the log_mutex of the inode we are logging. A task setting a new value for an inode's last_unlink_trans must acquire the inode's log_mutex and it must do this update before doing the actual unlink operation (which is already the case except when deleting a snapshot). Conversely the task logging the inode must first log the inode and then check the inode's last_unlink_trans value while holding its log_mutex, as if its value is not greater then the id of the last committed transaction it means it logged a safe state of the inode's items, while if its value is not smaller then the id of the last committed transaction it means the inode state it has logged might not be safe (the concurrent task might have just updated last_unlink_trans but hasn't done yet the unlink operation) and therefore a transaction commit must be done. Test cases for xfstests follow in separate patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1ec9a1ae |
|
10-Feb-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix unreplayable log after snapshot delete + parent dir fsync If we delete a snapshot, fsync its parent directory and crash/power fail before the next transaction commit, on the next mount when we attempt to replay the log tree of the root containing the parent directory we will fail and prevent the filesystem from mounting, which is solvable by wiping out the log trees with the btrfs-zero-log tool but very inconvenient as we will lose any data and metadata fsynced before the parent directory was fsynced. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ btrfs subvolume snapshot /mnt /mnt/testdir/snap $ btrfs subvolume delete /mnt/testdir/snap $ xfs_io -c "fsync" /mnt/testdir < crash / power failure and reboot > $ mount /dev/sdc /mnt mount: mount(2) failed: No such file or directory And in dmesg/syslog we get the following message and trace: [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [192066.363010] ------------[ cut here ]------------ [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]() [192066.367250] BTRFS: Transaction aborted (error -2) [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1 [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8 [192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe [192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710 [192066.385827] Call Trace: [192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79 [192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2 [192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50 [192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266 [192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs] [192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs] [192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs] [192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs] [192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs] [192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs] [192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs] [192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs] [192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs] [192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs] [192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2 [192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c [192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1 [192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree) [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30 [192066.444613] BTRFS: open_ctree failed This happens because when we are replaying the log and processing the directory entry pointing to the snapshot in the subvolume tree, we treat its btrfs_dir_item item as having a location with a key type matching BTRFS_INODE_ITEM_KEY, which is wrong because the type matches BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the object id refers to a root number and not to an inode in the root containing the parent directory. So fix this by triggering a transaction commit if an fsync against the parent directory is requested after deleting a snapshot. This is the simplest approach for a rare use case. Some alternative that avoids the transaction commit would require more code to explicitly delete the snapshot at log replay time (factoring out common code from ioctl.c: btrfs_ioctl_snap_destroy()), special care at fsync time to remove the log tree of the snapshot's root from the log root of the root of tree roots, amongst other steps. A test case for xfstests that triggers the issue follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create a snapshot at the root of our filesystem (mount point path), delete it, # fsync the mount point path, crash and mount to replay the log. This should # succeed and after the filesystem is mounted the snapshot should not be visible # anymore. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT _flakey_drop_and_remount [ -e $SCRATCH_MNT/snap1 ] && \ echo "Snapshot snap1 still exists after log replay" # Similar scenario as above, but this time the snapshot is created inside a # directory and not directly under the root (mount point path). mkdir $SCRATCH_MNT/testdir _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir _flakey_drop_and_remount [ -e $SCRATCH_MNT/testdir/snap2 ] && \ echo "Snapshot snap2 still exists after log replay" _unmount_flakey echo "Silence is golden" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d5131b65 |
|
17-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused argument in btrfs_ioctl_get_supported_features Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
c5868f83 |
|
17-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls The control device is accessible when no filesystem is mounted and we may want to query features supported by the module. This is already possible using the sysfs files, this ioctl is for parity and convenience. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
11ea474f |
|
11-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: remove error message from search ioctl for nonexistent tree Let's remove the error message that appears when the tree_id is not present. This can happen with the quota tree and has been observed in practice. The applications are supposed to handle -ENOENT and we don't need to report that in the system log as it's not a fatal error. Reported-by: Vlastimil Babka <vbabka@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
04b285f3 |
|
07-Feb-2016 |
Deepa Dinamani <deepa.kernel@gmail.com> |
btrfs: Replace CURRENT_TIME by current_fs_time() CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ac1407ba |
|
27-Jan-2016 |
Byongho Lee <bhlee.kernel@gmail.com> |
btrfs: remove redundant error check While running btrfs_mksubvol(), d_really_is_positive() is called twice. First in btrfs_mksubvol() and second inside btrfs_may_create(). So I remove the first one. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
66722f7c |
|
11-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: switch to kcalloc in btrfs_cmp_data_prepare Kcalloc is functionally equivalent and does overflow checks. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
fd95ef56 |
|
11-Feb-2016 |
David Sterba <dsterba@suse.com> |
btrfs: extent same: use GFP_KERNEL for page array allocations We can safely use GFP_KERNEL in the functions called from the ioctl handlers. Here we can allocate up to 32k so less pressure to the allocator could help. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
31314002 |
|
27-Jan-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix page reading in extent_same ioctl leading to csum errors In the extent_same ioctl, we were grabbing the pages (locked) and attempting to read them without bothering about any concurrent IO against them. That is, we were not checking for any ongoing ordered extents nor waiting for them to complete, which leads to a race where the extent_same() code gets a checksum verification error when it reads the pages, producing a message like the following in dmesg and making the operation fail to user space with -ENOMEM: [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868 Fix this by using btrfs_readpage() for reading the pages instead of extent_read_full_page_nolock(), which waits for any concurrent ordered extents to complete and locks the io range. Also do better error handling and don't treat all failures as -ENOMEM, as that's clearly misleasing, becoming identical to the checks and operation of prepare_uptodate_page(). The use of extent_read_full_page_nolock() was required before commit f441460202cb ("btrfs: fix deadlock with extent-same and readpage"), as we had the range locked in an inode's io tree before attempting to read the pages. Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage") Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
e0bd70c6 |
|
27-Jan-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl In the extent_same ioctl we are getting the pages for the source and target ranges and unlocking them immediately after, which is incorrect because later we attempt to map them (with kmap_atomic) and access their contents at btrfs_cmp_data(). When we do such access the pages might have been relocated or removed from memory, which leads to an invalid memory access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y which produces a trace like the following: 186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64 parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1 [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000 [186736.681319] RIP: 0010:[<ffffffff81264d00>] [<ffffffff81264d00>] memcmp+0xb/0x22 [186736.681319] RSP: 0018:ffff880362287d70 EFLAGS: 00010287 [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000 [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000 [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000 [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000 [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40 [186736.681319] FS: 00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000 [186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0 [186736.681319] Stack: [186736.681319] ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000 [186736.681319] 0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400 [186736.681319] 00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038 [186736.681319] Call Trace: [186736.681319] [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs] [186736.681319] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [186736.681319] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [186736.681319] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [186736.681319] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [186736.681319] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [186736.681319] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6 04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0 (gdb) list *(btrfs_ioctl+0x24cb) 0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972). 2967 dst_addr = kmap_atomic(dst_page); 2968 2969 flush_dcache_page(src_page); 2970 flush_dcache_page(dst_page); 2971 2972 if (memcmp(addr, dst_addr, cmp_len)) 2973 ret = BTRFS_SAME_DATA_DIFFERS; 2974 2975 kunmap_atomic(addr); 2976 kunmap_atomic(dst_addr); So fix this by making sure we keep the pages locked and respect the same locking order as everywhere else: get and lock the pages first and then lock the range in the inode's io tree (like for example at __btrfs_buffered_write() and extent_readpages()). If an ordered extent is found after locking the range in the io tree, unlock the range, unlock the pages, wait for the ordered extent to complete and repeat the entire locking process until no overlapping ordered extents are found. Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
65bfa658 |
|
21-Jan-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone operation In subpagesize-blocksize scenario, the "destination offset" argument passed to the btrfs_ioctl_clone() can be aligned to sectorsize but may not be necessarily aligned to the machine's page size. In such cases, truncate_inode_pages_range() ends up zeroing out the partial page and future read operations will return incorrect data. Hence this commit explicitly rounds down the "destination offset" to the machine's page size. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e410e34f |
|
29-Jan-2016 |
Chris Mason <clm@fb.com> |
Revert "btrfs: synchronize incompat feature bits with sysfs files" This reverts commit 14e46e04958df740c6c6a94849f176159a333f13. This ends up doing sysfs operations from deep in balance (where we should be GFP_NOFS) and under heavy balance load, we're making races against sysfs internals. Revert it for now while we figure things out. Signed-off-by: Chris Mason <clm@fb.com>
|
#
5955102c |
|
22-Jan-2016 |
Al Viro <viro@zeniv.linux.org.uk> |
wrappers for ->i_mutex access parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
14e46e04 |
|
21-Jan-2016 |
David Sterba <dsterba@suse.com> |
btrfs: synchronize incompat feature bits with sysfs files The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount (eg. the LZO compression). Synchronize the feature bits with the sysfs files representing them right after we set/clear them. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
f32e48e9 |
|
07-Jan-2016 |
Chandan Rajendra <chandan@linux.vnet.ibm.com> |
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots The following call trace is seen when btrfs/031 test is executed in a loop, [ 158.661848] ------------[ cut here ]------------ [ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea() [ 158.664102] BTRFS: Transaction aborted (error -2) [ 158.664774] Modules linked in: [ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2 [ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930 [ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000 [ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980 [ 158.670769] Call Trace: [ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c [ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0 [ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50 [ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea [ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520 [ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50 [ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180 [ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70 [ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90 [ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300 [ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170 [ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70 [ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0 [ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490 [ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60 [ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80 [ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 158.685958] ---[ end trace 4b63312de5a2cb76 ]--- [ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry [ 158.709508] BTRFS info (device loop0): forced readonly [ 158.737113] BTRFS info (device loop0): disk space caching is enabled [ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed [ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30 This occurs because, Mount filesystem Create subvol with ID 257 Unmount filesystem Mount filesystem Delete subvol with ID 257 btrfs_drop_snapshot() Add root corresponding to subvol 257 into btrfs_transaction->dropped_roots list Create new subvol (i.e. create_subvol()) 257 is returned as the next free objectid btrfs_read_fs_root_no_name() Finds the btrfs_root instance corresponding to the old subvol with ID 257 in btrfs_fs_info->fs_roots_radix. Returns error since btrfs_root_item->refs has the value of 0. To fix the issue the commit initializes tree root's and subvolume root's highest_objectid when loading the roots from disk. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8546b570 |
|
10-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: preallocate path for snapshot creation at ioctl time We can also preallocate btrfs_path that's used during pending snapshot creation and avoid another late ENOMEM failure. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
b0c0ea63 |
|
10-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: allocate root item at snapshot ioctl time The actual snapshot creation is delayed until transaction commit. If we cannot get enough memory for the root item there, we have to fail the whole transaction commit which is bad. So we'll allocate the memory at the ioctl call and pass it along with the pending_snapshot struct. The potential ENOMEM will be returned to the caller of snapshot ioctl. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a1ee7362 |
|
10-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: do an allocation earlier during snapshot creation We can allocate pending_snapshot earlier and do not have to do cleanup in case of failure. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
e4058b54 |
|
27-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: cleanup, use enum values for btrfs_path reada Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4d4ab6d6 |
|
19-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: constify static arrays There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ee22184b |
|
14-Dec-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
2b3909f8 |
|
19-Dec-2015 |
Darrick J. Wong <darrick.wong@oracle.com> |
btrfs: use new dedupe data function pointer Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
04b38d60 |
|
02-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
vfs: pull btrfs clone API to vfs layer The btrfs clone ioctls are now adopted by other file systems, with NFS and CIFS already having support for them, and XFS being under active development. To avoid growth of various slightly incompatible implementations, add one to the VFS. Note that clones are different from file copies in several ways: - they are atomic vs other writers - they support whole file clones - they support 64-bit legth clones - they do not allow partial success (aka short writes) - clones are expected to be a fast metadata operation Because of that it would be rather cumbersome to try to piggyback them on top of the recent clone_file_range infrastructure. The converse isn't true and the clone_file_range system call could try clone file range as a first attempt to copy, something that further patches will enable. Based on earlier work from Peng Tao. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8d2db785 |
|
04-Nov-2015 |
David Sterba <dsterba@suse.com> |
btrfs: use GFP_KERNEL for allocations in ioctl handlers We don't have to use GFP_NOFS in the ioctl handlers because there's no risk of looping through the allocators back to the filesystem. This patch covers only allocations that are directly in the ioctl handlers. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ff13db41 |
|
03-Dec-2015 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused parameter from lock_extent_bits We've always passed 0. Stack usage will slightly decrease. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
3db11b2e |
|
10-Nov-2015 |
Zach Brown <zab@redhat.com> |
btrfs: add .copy_file_range file operation This rearranges the existing COPY_RANGE ioctl implementation so that the .copy_file_range file operation can call the core loop that copies file data extent items. The extent copying loop is lifted up into its own function. It retains the core btrfs error checks that should be shared. Signed-off-by: Zach Brown <zab@redhat.com> [Anna Schumaker: Make flags an unsigned int, Check for COPY_FR_REFLINK] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
849ef928 |
|
12-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: check unsupported filters in balance arguments We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b06c4bf5 |
|
23-Oct-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix regression running delayed references when using qgroups In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
|
#
7cf5b976 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: qgroup: Cleanup old inaccurate facilities Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
df480633 |
|
08-Sep-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: extent-tree: Switch to new delalloc space reserve and release Use new __btrfs_delalloc_reserve_space() and __btrfs_delalloc_release_space() to reserve and release space for delalloc. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0f89abf5 |
|
20-Oct-2015 |
Christian Engelmayer <cengelma@gmx.at> |
btrfs: fix possible leak in btrfs_ioctl_balance() Commit 8eb934591f8b ("btrfs: check unsupported filters in balance arguments") adds a jump to exit label out_bargs in case the argument check fails. At this point in addition to the bargs memory, the memory for struct btrfs_balance_control has already been allocated. Ownership of bctl is passed to btrfs_balance() in the good case, thus the memory is not freed due to the introduced jump. Make sure that the memory gets freed in any case as necessary. Detected by Coverity CID 1328378. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d7641a49 |
|
01-Sep-2015 |
Byongho Lee <bhlee.kernel@gmail.com> |
btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entry There is no removing list element while iterating over list. So, replace list_for_each_entry_safe to list_for_each_entry. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
8039d87d |
|
13-Oct-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix file corruption and data loss after cloning inline extents Currently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ | _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size not greater than the size of # bar's inline extent (40 < 50). # It should be possible to do the extent cloning from bar to this file. $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4 # Doing IO against any range in the first 4K of the file should work. echo "File foo4 data after clone operation:" # Must match file bar's content. od -t x1 $SCRATCH_MNT/foo4 $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size greater than the size of bar's # inline extent (60 > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5 # Reading the file should not fail. echo "File foo5 data after clone operation:" # Must have a size of 60 bytes, with all bytes having a value of 0x03 # (the clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo5 # Test cloning the inline extent against a file which has no extents but # has a size greater than bar's inline extent (16K > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6 # Reading the file should not fail. echo "File foo6 data after clone operation:" # Must have a size of 16K, with all bytes having a value of 0x00 (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo6 # Test cloning the inline extent against a file which has no extents but # has a size not greater than bar's inline extent (30 < 50). # It should be possible to clone the inline extent from file bar into # this file. $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7 # Reading the file should not fail. echo "File foo7 data after clone operation:" # Must have a size of 50 bytes, with all bytes having a value of 0xbb. od -t x1 $SCRATCH_MNT/foo7 # Test cloning the inline extent against a file which has a size not # greater than the size of bar's inline extent (20 < 50) but has # a prealloc extent that goes beyond the file's size. It should not be # possible to clone the inline extent from bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" \ -c "pwrite -S 0x88 0 20" \ $SCRATCH_MNT/foo8 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8 echo "File foo8 data after clone operation:" # Must have a size of 20 bytes, with all bytes having a value of 0x88 # (the clone operation did not modify our file). od -t x1 $SCRATCH_MNT/foo8 _scratch_unmount } echo -e "\nTesting without compression and without the no-holes feature...\n" test_cloning_inline_extents echo -e "\nTesting with compression and without the no-holes feature...\n" test_cloning_inline_extents "" "-o compress" echo -e "\nTesting without compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "" echo -e "\nTesting with compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "-o compress" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
8eb93459 |
|
12-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: check unsupported filters in balance arguments We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f14d104d |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: switch more printks to our helpers Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>
|
#
ecaeb14b |
|
08-Oct-2015 |
David Sterba <dsterba@suse.com> |
btrfs: switch message printers to _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>
|
#
a4553fef |
|
25-Sep-2015 |
Anand Jain <anand.jain@oracle.com> |
Btrfs: consolidate btrfs_error() to btrfs_std_error() btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
293a8489 |
|
30-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: fix clone / extent-same deadlocks Clone and extent same lock their source and target inodes in opposite order. In addition to this, the range locking in clone doesn't take ordering into account. Fix this by having clone use the same locking helpers as btrfs-extent-same. In addition, I do a small cleanup of the locking helpers, removing a case (both inodes being the same) which was poorly accounted for and never actually used by the callers. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4a3560c4 |
|
07-Aug-2015 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix defrag to merge tail file extent The file layout is [extent 1]...[extent n][4k extent][HOLE][extent x] extent 1~n and 4k extent can be merged during defrag, and the whole defrag bytes is larger than our defrag thresh(256k), 4k extent as a tail is left unmerged since we check if its next extent can be merged (the next one is a hole, so the check will fail), the layout thus can be [new extent][4k extent][HOLE][extent x] (1~n) To fix it, beside looking at the next one, this also looks at the previous one by checking @defrag_end, which is set to 0 when we decide to stop merging contiguous extents, otherwise, we can merge the previous one with our extent. Also, this makes btrfs behave consistent with how xfs and ext4 do. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
dd81d459 |
|
29-Jun-2015 |
Naohiro Aota <naota@elisp.net> |
btrfs: fix search key advancing condition The search key advancing condition used in copy_to_sk() is loose. It can advance the key even if it reaches sk->max_*: e.g. when the max key = (512, 1024, -1) and the current key = (512, 1025, 10), it increments the offset by 1, continues hopeless search from (512, 1025, 11). This issue make ioctl() to take unexpectedly long time scanning all the leaf a blocks one by one. This commit fix the problem using standard way of key comparison: btrfs_comp_cpu_keys() Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ed958762 |
|
14-Jul-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix file corruption after cloning inline extents Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
|
#
497b4050 |
|
03-Jul-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix memory leak in the extent_same ioctl We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
|
#
1c919a5e |
|
30-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: don't update mtime/ctime on deduped inodes One issue users have reported is that dedupe changes mtime on files, resulting in tools like rsync thinking that their contents have changed when in fact the data is exactly the same. We also skip the ctime update as no user-visible metadata changes here and we want dedupe to be transparent to the user. Clone still wants time changes, so we special case this in the code. This was tested with the btrfs-extent-same tool. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
|
#
0efa9f48 |
|
30-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: allow dedupe of same inode clone() supports cloning within an inode so extent-same can do the same now. This patch fixes up the locking in extent-same to know about the single-inode case. In addition to that, we add a check for overlapping ranges, which clone does not allow. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f4414602 |
|
30-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: fix deadlock with extent-same and readpage ->readpage() does page_lock() before extent_lock(), we do the opposite in extent-same. We want to reverse the order in btrfs_extent_same() but it's not quite straightforward since the page locks are taken inside btrfs_cmp_data(). So I split btrfs_cmp_data() into 3 parts with a small context structure that is passed between them. The first, btrfs_cmp_data_prepare() gathers up the pages needed (taking page lock as required) and puts them on our context structure. At this point, we are safe to lock the extent range. Afterwards, we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free() to clean up our context. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
207910dd |
|
30-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: pass unaligned length to btrfs_cmp_data() In the case that we dedupe the tail of a file, we might expand the dedupe len out to the end of our last block. We don't want to compare data past i_size however, so pass the original length to btrfs_cmp_data(). Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e1d227a4 |
|
08-Jun-2015 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: Handle unaligned length in extent_same The extent-same code rejects requests with an unaligned length. This poses a problem when we want to dedupe the tail extent of files as we skip cloning the portion between i_size and the extent boundary. If we don't clone the entire extent, it won't be deleted. So the combination of these behaviors winds up giving us worst-case dedupe on many files. We can fix this by allowing a length that extents to i_size and internally aligining those to the end of the block. This is what btrfs_ioctl_clone() so we can just copy that check over. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
|
#
070034bd |
|
08-Jun-2015 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag. max_to_defrag represents the number of pages to defrag rather than the last page of the file range to be defragged. Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag should actually have the value 4. Instead in the current code we end up setting it to 6. Now, this does not (yet) cause an issue since the first part of the while loop condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control to flow out of the while loop before any buggy behavior is actually caused. So the patch just makes sure that max_to_defrag ends up having the right value rather than fixing a bug. I did run the xfstests suite to make sure that the code does not regress. Changelog: v1->v2: Provide a much descriptive commit message. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e4826a5b |
|
09-Jun-2015 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: btrfs_defrag_file: Fix ra_index computation. Read-ahead is done for the pages in the range [ra_index, ra_index + cluster - 1]. So the next read-ahead should be starting from the page at index 'ra_index + cluster' (unless we deemed that the extent at 'ra_index + cluster' as non-defraggable) rather than from the page at index 'ra_index + max_cluster'. This patch fixes this. I did run the xfstests suite to make sure that the code does not regress. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
01b810b8 |
|
12-May-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: make root id query unprivileged The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6d13f549 |
|
24-Apr-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: fix warnings after changes in btrfs_abort_transaction fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
64ad6c48 |
|
02-Jun-2015 |
Omar Sandoval <osandov@osandov.com> |
Btrfs: don't invalidate root dentry when subvolume deletion fails Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
909e26dc |
|
10-Apr-2015 |
Omar Sandoval <osandov@osandov.com> |
btrfs: unlock i_mutex after attempting to delete subvolume during send Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2b0143b5 |
|
17-Mar-2015 |
David Howells <dhowells@redhat.com> |
VFS: normal filesystems (and lustre): d_inode() annotations that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e082f563 |
|
27-Feb-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: quota: Update quota tree after qgroup relationship change. Previous patch modified the in memory struct but it's not written in quota tree until next commit. So user will still get old data using "btrfs qgroup show" after assign/remove. This patch will call btrfs_run_qgroups in assign ioctl so it will be updated to in memory quota trees and user will get up-to-date results. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e09fe2d2 |
|
27-Feb-2015 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created Btrfs will create qgroup on subvolume creation if quota is enabled, but qgroup uses the high bits(currently 16 bits) as level, to build the inheritance. However it is fully possible a subvolume can be created with a subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be considered as level 1 and can't be assigned to other qgroup in level 1. This patch will prevent such things so qgroup inheritance will not be screwed up. The downside is very clear, btrfs subvolume number limit will decrease from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to (u48 max -256(first free objectid)). But we still have near u48(that's 15 digits in dec), so that should not be a huge problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4087cf24 |
|
18-Jan-2015 |
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> |
Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup(). Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
de249e66 |
|
11-Apr-2015 |
Chris Mason <clm@fb.com> |
Btrfs: fix uninit variable in clone ioctl Commit 0d97a64e0 creates a new variable but doesn't always set it up. This puts it back to the original method (key.offset + 1) for the cases not covered by Filipe's new logic. Signed-off-by: Chris Mason <clm@fb.com>
|
#
ccccf3d6 |
|
30-Mar-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix inode eviction infinite loop after cloning into it If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
113e8283 |
|
30-Mar-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix inode eviction infinite loop after extent_same ioctl If we pass a length of 0 to the extent_same ioctl, we end up locking an extent range with a start offset greater then its end offset (if the destination file's offset is greater than zero). This results in a warning from extent_io.c:insert_state through the following call chain: btrfs_extent_same() btrfs_double_lock() lock_extent_range() lock_extent(inode->io_tree, offset, offset + len - 1) lock_extent_bits() __set_extent_bit() insert_state() --> WARN_ON(end < start) This leads to an infinite loop when evicting the inode. This is the same problem that my previous patch titled "Btrfs: fix inode eviction infinite loop after cloning into it" addressed but for the extent_same ioctl instead of the clone ioctl. CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
df858e76 |
|
31-Mar-2015 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix range cloning when same inode used as source and destination While searching for extents to clone we might find one where we only use a part of it coming from its tail. If our destination inode is the same the source inode, we end up removing the tail part of the extent item and insert after a new one that point to the same extent with an adjusted key file offset and data offset. After this we search for the next extent item in the fs/subvol tree with a key that has an offset incremented by one. But this second search leaves us at the new extent item we inserted previously, and since that extent item has a non-zero data offset, it it can make us call btrfs_drop_extents with an empty range (start == end) which causes the following warning: [23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]() (...) [23978.557266] Call Trace: [23978.557978] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.559191] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.560699] [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.562389] [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c [23978.563613] [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.565103] [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28 [23978.566294] [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf [23978.567438] [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs] [23978.568702] [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf [23978.569763] [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb [23978.570817] [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36 [23978.571872] [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb [23978.573466] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [23978.574962] [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs] [23978.576179] [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs] [23978.577311] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.578520] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.580282] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.591887] ---[ end trace 988ec2a653d03ed3 ]--- Then we attempt to insert a new extent item with a key that already exists, which makes btrfs_insert_empty_item return -EEXIST resulting in abortion of the current transaction: [23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() (...) [23978.622589] Call Trace: [23978.623181] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.624359] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.625573] [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.626971] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [23978.628003] [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f [23978.629138] [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.630528] [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs] [23978.631635] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.632886] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.634119] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.647714] ---[ end trace 988ec2a653d03ed4 ]--- This is wrong because we should not process the extent item that we just inserted previously, and instead process the extent item that follows it in the tree For example for the test case I wrote for fstests: bs=$((64 * 1024)) mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo The second clone call fails with -EEXIST, because when we process the first extent item (offset 262144), we drop part of it (counting from the end) and then insert a new extent item with a key greater then the key we found. The next time we search the tree we search for a key with offset 262144 + 1, which leaves us at the new extent item we have just inserted but we think it refers to an extent that we need to clone. Fix this by ensuring the next search key uses an offset corresponding to the offset of the key we found previously plus the data length of the corresponding extent item. This ensures we skip new extent items that we inserted and works for the case of implicit holes too (NO_HOLES feature). A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
b8b93add |
|
16-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup 64bit/32bit divs, provably bounded values The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
3284da7b |
|
25-Feb-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: use explicit initializer for seq_elem Using {} as initializer for struct seq_elem does not properly initialize the list_head member, but it currently works because it gets set through btrfs_get_tree_mod_seq if 'seq' is 0. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
c7abe829 |
|
16-Jan-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup 64bit/32bit divs, provably bounded values The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
e36cb0b8 |
|
28-Jan-2015 |
David Howells <dhowells@redhat.com> |
VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) Convert the following where appropriate: (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry). (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry). (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more complicated than it appears as some calls should be converted to d_can_lookup() instead. The difference is whether the directory in question is a real dir with a ->lookup op or whether it's a fake dir with a ->d_automount op. In some circumstances, we can subsume checks for dentry->d_inode not being NULL into this, provided we the code isn't in a filesystem that expects d_inode to be NULL if the dirent really *is* negative (ie. if we're going to use d_inode() rather than d_backing_inode() to get the inode pointer). Note that the dentry type field may be set to something other than DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS manages the fall-through from a negative dentry to a lower layer. In such a case, the dentry type of the negative union dentry is set to the same as the type of the lower dentry. However, if you know d_inode is not NULL at the call site, then you can use the d_is_xxx() functions even in a filesystem. There is one further complication: a 0,0 chardev dentry may be labelled DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was intended for special directory entry types that don't have attached inodes. The following perl+coccinelle script was used: use strict; my @callers; open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') || die "Can't grep for S_ISDIR and co. callers"; @callers = <$fd>; close($fd); unless (@callers) { print "No matches\n"; exit(0); } my @cocci = ( '@@', 'expression E;', '@@', '', '- S_ISLNK(E->d_inode->i_mode)', '+ d_is_symlink(E)', '', '@@', 'expression E;', '@@', '', '- S_ISDIR(E->d_inode->i_mode)', '+ d_is_dir(E)', '', '@@', 'expression E;', '@@', '', '- S_ISREG(E->d_inode->i_mode)', '+ d_is_reg(E)' ); my $coccifile = "tmp.sp.cocci"; open($fd, ">$coccifile") || die $coccifile; print($fd "$_\n") || die $coccifile foreach (@cocci); close($fd); foreach my $file (@callers) { chomp $file; print "Processing ", $file, "\n"; system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 || die "spatch failed"; } [AV: overlayfs parts skipped] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9ea24bbe |
|
29-Oct-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix snapshot inconsistency after a file write followed by truncate If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c "pwrite -S 0xaa -b 32K 0 32K" \ -c "fsync" \ -c "pwrite -S 0xbb -b 32770 16K 32770" \ -c "truncate 90123" \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: "root 331 inode 257 errors 100, file extent discount" >From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e5fa8f86 |
|
21-Oct-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: ensure send always works on roots without orphans Move the logic from the snapshot creation ioctl into send. This avoids doing the transaction commit if send isn't used, and ensures that if a crash/reboot happens after the transaction commit that created the snapshot and before the transaction commit that switched the commit root, send will not get a commit root that differs from the main root (that has orphan items). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ddb52f4f |
|
21-Oct-2014 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: get rid of f_dentry use Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cbdf35bc |
|
23-Oct-2014 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: export check_sticky() It's already duplicated in btrfs and about to be used in overlayfs too. Move the sticky bit check to an inline helper and call the out-of-line helper only in the unlikly case of the sticky bit being set. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
#
d3797308 |
|
15-Oct-2014 |
Chris Mason <clm@fb.com> |
Revert "Btrfs: race free update of commit root for ro snapshots" This reverts commit 9c3b306e1c9e6be4be09e99a8fe2227d1005effc. Switching only one commit root during a transaction is wrong because it leads the fs into an inconsistent state. All commit roots should be switched at once, at transaction commit time, otherwise backref walking can often miss important references that were only accessible through the old commit root. Plus, the root item for the snapshot's root wasn't getting updated and preventing the next transaction commit to do it. This made several users get into random corruption issues after creation of readonly snapshots. A regression test for xfstests will follow soon. Cc: stable@vger.kernel.org # 3.17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5542aa2f |
|
13-Feb-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
vfs: Make d_invalidate return void Now that d_invalidate can no longer fail, stop returning a useless return code. For the few callers that checked the return code update remove the handling of d_invalidate failure. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4d75f8a9 |
|
14-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove blocksize from btrfs_alloc_free_block and rename Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free + _tree_block family. The parameter blocksize was set to the metadata block size, directly or indirectly. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
ee39b432 |
|
29-Sep-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove unlikely from data-dependent branches and slow paths There are the branch hints that obviously depend on the data being processed, the CPU predictor will do better job according to the actual load. It also does not make sense to use the hints in slow paths that do a lot of other operations like locking, waiting or IO. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
aab110ab |
|
29-Jul-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: defrag, use unsigned type for extent thresh Signed type mismatches the ioctl structure, all extent calculations are done on unsigned types. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
7cc8e58d |
|
03-Sep-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unprotected device's variants on 32bits machine ->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk lock when we change them, but sometimes we read them without any lock, and we might get unexpected value. We fix this problem like inode's i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
78a017a2 |
|
11-Sep-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add missing compression property remove in btrfs_ioctl_setflags The behaviour of a 'chattr -c' consists of getting the current flags, clearing the FS_COMPR_FL bit and then sending the result to the set flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags passed to the ioctl. This results in the compression property not being cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL was set in the received flags. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt && cd /mnt $ mkdir a $ chattr +c a $ touch a/file $ lsattr a/file --------c------- a/file $ chattr -c a $ touch a/file2 $ lsattr a/file2 --------c------- a/file2 $ lsattr -d a ---------------- a Reported-by: Andreas Schneider <asn@cryptomilk.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f98de9b9 |
|
04-Aug-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: make btrfs_search_forward return with nodes unlocked None of the uses of btrfs_search_forward() need to have the path nodes (level >= 1) read locked, only the leaf needs to be locked while the caller processes it. Therefore make it return a path with all nodes unlocked, except for the leaf. This change is motivated by the observation that during a file fsync we repeatdly call btrfs_search_forward() and process the returned leaf while upper nodes of the returned path (level >= 1) are read locked, which unnecessarily blocks other tasks that want to write to the same fs/subvol btree. Therefore instead of modifying the fsync code to unlock all nodes with level >= 1 immediately after calling btrfs_search_forward(), change btrfs_search_forward() to do it, so that it benefits all callers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2fad4e83 |
|
23-Jul-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: wake up transaction thread from SYNC_FS ioctl The transaction thread may want to do more work, namely it pokes the cleaner ktread that will start processing uncleaned subvols. This can be triggered by user via the 'btrfs fi sync' command, otherwise there was a delay up to 30 seconds before the cleaner started to clean old snapshots. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ec95d491 |
|
30-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: device delete must be sysloged as in the disk add patch, disk detached from the volume must be recorded in the syslog as well for the same reason. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
43d20761 |
|
30-Jun-2014 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: device add must be sysloged when we add a new disk to the mounted btrfs we don't record it as of now, disk add is a critical change of btrfs configuration, it must be recorded in the syslog to help offline investigations of customer problems when reported. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ed6078f7 |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: use DIV_ROUND_UP instead of open-coded variants The form (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT is equivalent to (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE The rest is a simple subsitution, no difference in the generated assembly code. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
707e8a07 |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: use nodesize everywhere, kill leafsize The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
962a298f |
|
04-Jun-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: kill the key type accessor helpers btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
57cdc8db |
|
04-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: cleanup ino cache members of btrfs_root The naming is confusing, generic yet used for a specific cache. Add a prefix 'ino_' or rename appropriately. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c47ca32d |
|
04-Sep-2014 |
Dan Carpenter <dan.carpenter@oracle.com> |
Btrfs: kfree()ing ERR_PTRs The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them. These kind of bugs are "One Err Bugs" where there is just one error label that does everything. I could set the "inherit = NULL" and keep the single out label but it ends up being more complicated that way. It makes the code simpler to re-order the unwind so it's in the mirror order of the allocation and introduce some new error labels. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e9512d72 |
|
26-Aug-2014 |
Chris Mason <clm@fb.com> |
Btrfs: fix autodefrag with compression The autodefrag code skips defrag when two extents are adjacent. But one big advantage for autodefrag is cutting down on the number of small extents, even when they are adjacent. This commit changes it to defrag all small extents. Signed-off-by: Chris Mason <clm@fb.com>
|
#
62e2390e |
|
07-Aug-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: clone, don't create invalid hole extent map When cloning a file that consists of an inline extent, we were creating an extent map that represents a non-existing trailing hole starting at a file offset that isn't a multiple of the sector size. This happened because when processing an inline extent we weren't aligning the extent's length to the sector size, and therefore incorrectly treating the range [inline_extent_length; sector_size[ as a hole. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
9c3b306e |
|
31-Jul-2014 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: race free update of commit root for ro snapshots This is a better solution for the problem addressed in the following commit: Btrfs: update commit root on snapshot creation after orphan cleanup (3821f348889e506efbd268cc8149e0ebfa47c4e5) The previous solution wasn't the best because of 2 reasons: 1) It added another full transaction commit, which is more expensive than just swapping the commit root with the root; 2) If a reboot happened after the first transaction commit (the one that creates the snapshot) and before the second transaction commit, then we would end up with the same problem if a send using that snapshot was requested before the first transaction commit after the reboot. This change addresses those 2 issues. The second issue is addressed by switching the commit root in the dentry lookup VFS callback, which is also called by the snapshot/subvol creation ioctl and performs orphan cleanup if needed. Like the vfs, the ioctl locks the parent inode too, preventing race issues between a dentry lookup and snapshot creation. Cc: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
14f59796 |
|
29-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: fix use-after-free when cloning a trailing file hole The transaction handle was being used after being freed. Cc: Chris Mason <clm@fb.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3cc79392 |
|
25-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: atomically set inode->i_flags in btrfs_update_iflags This change is based on the corresponding recent change for ext4: ext4: atomically set inode->i_flags in ext4_set_inode_flags() That has the following commit message that applies to btrfs as well: "Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time." Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE and BTRFS_INODE_APPEND, respectively. Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
cc68a8a5 |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: new ioctl TREE_SEARCH_V2 This new ioctl call allows the user to supply a buffer of varying size in which a tree search can store its results. This is much more flexible if you want to receive items which are larger than the current fixed buffer of 3992 bytes or if you want to fetch more items at once. Items larger than this buffer are for example some of the type EXTENT_CSUM. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
ba346b35 |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: tree_search, search_ioctl: direct copy to userspace By copying each found item seperatly to userspace, we do not need extra buffer in the kernel. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
9b6e817d |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW If an item in tree_search is too large to be stored in the given buffer, return the needed size (including the header). Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
8f5f6178 |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer In copy_to_sk, if an item is too large for the given buffer, it now returns -EOVERFLOW instead of copying a search_header with len = 0. For backward compatibility for the first item it still copies such a header to the buffer, but not any other following items, which could have fitted. tree_search changes -EOVERFLOW back to 0 to behave similiar to the way it behaved before this patch. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
12544442 |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: tree_search, search_ioctl: accept varying buffer rewrite search_ioctl to accept a buffer with varying size Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
25c9bc2e |
|
30-Jan-2014 |
Gerhard Heift <gerhard@heift.name> |
btrfs: tree_search: eliminate redundant nr_items check If the amount of items reached the given limit of nr_items, we can leave copy_to_sk without updating the key. Also by returning 1 we leave the loop in search_ioctl without rechecking if we reached the given limit. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>
|
#
7ffbb598 |
|
08-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: make fsync work after cloning into a file When cloning into a file, we were correctly replacing the extent items in the target range and removing the extent maps. However we weren't replacing the extent maps with new ones that point to the new extents - as a consequence, an incremental fsync (when the inode doesn't have the full sync flag) was a NOOP, since it relies on the existence of extent maps in the modified list of the inode's extent map tree, which was empty. Therefore add new extent maps to reflect the target clone range. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
93915584 |
|
04-Jun-2014 |
Antonio Ospite <ao2@ao2.it> |
trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ Signed-off-by: Antonio Ospite <ao2@ao2.it> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com>
|
#
f82a9901 |
|
31-May-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled If the NO_HOLES feature is enabled holes don't have file extent items in the btree that represent them anymore. This made the clone operation ignore the gaps that exist between consecutive file extent items and therefore not create the holes at the destination. When not using the NO_HOLES feature, the holes were created at the destination. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
902c68a4 |
|
28-May-2014 |
Gui Hecheng <guihc.fnst@cn.fujitsu.com> |
btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX To be accurate about the error case, if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3821f348 |
|
02-Jun-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: update commit root on snapshot creation after orphan cleanup On snapshot creation (either writable or read-only), we do orphan cleanup against the root of the snapshot. If the cleanup did remove any orphans, then the current root node will be different from the commit root node until the next transaction commit happens. A send operation always uses the commit root of a snapshot - this means it will see the orphans if it starts computing the send stream before the next transaction commit happens (triggered by a timer or sync() for .e.g), which is when the commit root gets assigned a reference to current root, where the orphans are not visible anymore. The consequence of send seeing the orphans is explained below. For example: mkfs.btrfs -f /dev/sdd mount -o commit=999 /dev/sdd /mnt # open a file with O_TMPFILE and leave it open # write some data to the file btrfs subvolume snapshot -r /mnt /mnt/snap1 btrfs send /mnt/snap1 -f /tmp/send.data The send operation will fail with the following error: ERROR: send ioctl failed with -116: Stale file handle What happens here is that our snapshot has an orphan inode still visible through the commit root, that corresponds to the tmpfile. However send will attempt to call inode.c:btrfs_iget(), with the goal of reading the file's data, which will return -ESTALE because it will use the current root (and not the commit root) of the snapshot. Of course, there are other cases where we can get orphans, but this example using a tmpfile makes it much easier to reproduce the issue. Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if the commit root is different from the current root, just commit the transaction associated with the snapshot's root (if it exists), so that a send will not see any orphans that don't exist anymore. This also guarantees a send will always see the same content regardless of whether a transaction commit happened already before the send was requested and after the orphan cleanup (meaning the commit root and current roots are the same) or it hasn't happened yet (commit and current roots are different). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
ff5df9b8 |
|
30-May-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: ioctl, don't re-lock extent range when not necessary In ioctl.c:lock_extent_range(), after locking our target range, the ordered extent that btrfs_lookup_first_ordered_extent() returns us may not overlap our target range at all. In this case we would just unlock our target range, wait for any new ordered extents that overlap the range to complete, lock again the range and repeat all these steps until we don't get any ordered extent and the delalloc flag isn't set in the io tree for our target range. Therefore just stop if we get an ordered extent that doesn't overlap our target range and the dealalloc flag isn't set for the range in the inode's io tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2c463823 |
|
30-May-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: avoid visiting all extent items when cloning a range When cloning a range of a file, we were visiting all the extent items in the btree that belong to our source inode. We don't need to visit those extent items that don't overlap the range we are cloning, as doing so only makes us waste time and do unnecessary btree navigations (btrfs_next_leaf) for inodes that have a large number of file extent items in the btree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c55bfa67 |
|
24-May-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: set dead flag on the right root when destroying snapshot We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the parent of our target snapshot, instead of setting it in the target snapshot's root. This is easy to observe by running the following scenario: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt btrfs subvolume create /mnt/first_subvol btrfs subvolume snapshot -r /mnt /mnt/mysnap1 btrfs subvolume delete /mnt/first_subvol btrfs subvolume snapshot -r /mnt /mnt/mysnap2 btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data The send command failed because the send ioctl returned -EPERM. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c125b8bf |
|
22-May-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: ensure readers see new data after a clone operation We were cleaning the clone target file range from the page cache before we did replace the file extent items in the fs tree. This was racy, as right after cleaning the relevant range from the page cache and before replacing the file extent items, a read against that range could be performed by another task and populate again the page cache with stale data (stale after the cloning finishes). This would result in reads after the clone operation successfully finishes to get old data (and potentially for a very long time). Therefore evict the pages after replacing the file extent items, so that subsequent reads will always get the new data. Similarly, we were prone to races while cloning the file extent items because we weren't locking the target range and wait for any existing ordered extents against that range to complete. It was possible that after cloning the extent items, a write operation that was performed before the clone operation and overlaps the same range, would end up undoing all or part of the work the clone operation did (a worker task running inode.c:btrfs_finish_ordered_io). Therefore lock the target range in the io tree, wait for all pending ordered extents against that range to finish and then safely perform the cloning. The issue of reading stale data after the clone operation is easy to reproduce by running the following C program in a loop until it exits with return value 1. #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <pthread.h> #include <fcntl.h> #include <assert.h> #include <asm/types.h> #include <linux/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/ioctl.h> #define SRC_FILE "/mnt/sdd/foo" #define DST_FILE "/mnt/sdd/bar" #define FILE_SIZE (16 * 1024) #define PATTERN_SRC 'X' #define PATTERN_DST 'Y' struct btrfs_ioctl_clone_range_args { __s64 src_fd; __u64 src_offset, src_length; __u64 dest_offset; }; #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \ struct btrfs_ioctl_clone_range_args) static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static int clone_done = 0; static int reader_ready = 0; static int stale_data = 0; static void *reader_loop(void *arg) { char buf[4096], want_buf[4096]; memset(want_buf, PATTERN_SRC, 4096); pthread_mutex_lock(&mutex); reader_ready = 1; pthread_mutex_unlock(&mutex); while (1) { int done, fd, ret; fd = open(DST_FILE, O_RDONLY); assert(fd != -1); pthread_mutex_lock(&mutex); done = clone_done; pthread_mutex_unlock(&mutex); ret = read(fd, buf, 4096); assert(ret == 4096); close(fd); if (done) { ret = memcmp(buf, want_buf, 4096); if (ret == 0) { printf("Found new content\n"); } else { printf("Found old content\n"); pthread_mutex_lock(&mutex); stale_data = 1; pthread_mutex_unlock(&mutex); } break; } } return NULL; } int main(int argc, char *argv[]) { pthread_t reader; int ret, i, fd; struct btrfs_ioctl_clone_range_args clone_args; int fd1, fd2; ret = remove(SRC_FILE); if (ret == -1 && errno != ENOENT) { fprintf(stderr, "Error deleting src file: %s\n", strerror(errno)); return 1; } ret = remove(DST_FILE); if (ret == -1 && errno != ENOENT) { fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno)); return 1; } fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU); assert(fd != -1); for (i = 0; i < FILE_SIZE; i++) { char c = PATTERN_SRC; ret = write(fd, &c, 1); assert(ret == 1); } close(fd); fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU); assert(fd != -1); for (i = 0; i < FILE_SIZE; i++) { char c = PATTERN_DST; ret = write(fd, &c, 1); assert(ret == 1); } close(fd); sync(); ret = pthread_create(&reader, NULL, reader_loop, NULL); assert(ret == 0); while (1) { int r; pthread_mutex_lock(&mutex); r = reader_ready; pthread_mutex_unlock(&mutex); if (r) break; } fd1 = open(SRC_FILE, O_RDONLY); if (fd1 < 0) { fprintf(stderr, "Error open src file: %s\n", strerror(errno)); return 1; } fd2 = open(DST_FILE, O_RDWR); if (fd2 < 0) { fprintf(stderr, "Error open dst file: %s\n", strerror(errno)); return 1; } clone_args.src_fd = fd1; clone_args.src_offset = 0; clone_args.src_length = 4096; clone_args.dest_offset = 0; ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args); assert(ret == 0); close(fd1); close(fd2); pthread_mutex_lock(&mutex); clone_done = 1; pthread_mutex_unlock(&mutex); ret = pthread_join(reader, NULL); assert(ret == 0); pthread_mutex_lock(&mutex); ret = stale_data ? 1 : 0; pthread_mutex_unlock(&mutex); return ret; } Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
58dfae63 |
|
13-May-2014 |
ZhangZhen <zhenzhang.zhang@huawei.com> |
btrfs: replace simple_strtoull() with kstrtoull() use the newer and more pleasant kstrtoull() to replace simple_strtoull(), because simple_strtoull() is marked for obsoletion. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
fcebe456 |
|
13-May-2014 |
Josef Bacik <jbacik@fb.com> |
Btrfs: rework qgroup accounting Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
27cdeb70 |
|
02-Apr-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
61155aa0 |
|
15-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: assert that send is not in progres before root deletion CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
521e0546 |
|
15-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: protect snapshots from deleting during send The patch "Btrfs: fix protection between send and root deletion" (18f687d538449373c37c) does not actually prevent to delete the snapshot and just takes care during background cleaning, but this seems rather user unfriendly, this patch implements the idea presented in http://www.spinics.net/lists/linux-btrfs/msg30813.html - add an internal root_item flag to denote a dead root - check if the send_in_progress is set and refuse to delete, otherwise set the flag and proceed - check the flag in send similar to the btrfs_root_readonly checks, for all involved roots The root lookup in send via btrfs_read_fs_root_no_name will check if the root is really dead or not. If it is, ENOENT, aborted send. If it's alive, it's protected by send_in_progress, send can continue. CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e4ef90ff |
|
24-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: make FS_INFO ioctl available to anyone This ioctl provides basic info about the filesystem that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
7d6213c5 |
|
24-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: make DEV_INFO ioctl available to anyone This ioctl provides basic info about the devices that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
80a773fb |
|
07-May-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: retrieve more info from FS_INFO ioctl Provide the basic information about filesystem through the ioctl: * b-tree node size (same as leaf size) * sector size * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments Backward compatibility: if the values are 0, kernel does not provide this information, the applications should ignore them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d3ecfcdf |
|
08-May-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix EIO on reading file after ioctl clone works on it For inline data extent, we need to make its length aligned, otherwise, we can get a phantom extent map which confuses readpages() to return -EIO. This can be detected by xfstests/btrfs/035. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3f9e3df8 |
|
15-Apr-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: replace error code from btrfs_drop_extents There's a case which clone does not handle and used to BUG_ON instead, (testcase xfstests/btrfs/035), now returns EINVAL. This error code is confusing to the ioctl caller, as it normally signifies errorneous arguments. Change it to ENOPNOTSUPP which allows a fall back to copy instead of clone. This does not affect the common reflink operation. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
4e857c58 |
|
17-Mar-2014 |
Peter Zijlstra <peterz@infradead.org> |
arch: Mass conversion of smp_mb__*() Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
36523e95 |
|
07-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: export global block reserve size as space_info Introduce a block group type bit for a global reserve and fill the space info for SPACE_INFO ioctl. This should replace the newly added ioctl (01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part of the global reserve, while the actual usage can be now visible in the 'btrfs fi df' output during ENOSPC stress. The unpatched userspace tools will show the blockgroup as 'unknown'. CC: Jeff Mahoney <jeffm@suse.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3a29bc09 |
|
07-Apr-2014 |
Chris Mason <clm@fb.com> |
Btrfs: fix EINVAL checks in btrfs_clone btrfs_drop_extents can now return -EINVAL, but only one caller in btrfs_clone was checking for it. This adds it to the caller for inline extents, which is where we really need it. Signed-off-by: Chris Mason <clm@fb.com>
|
#
9a40f122 |
|
31-Mar-2014 |
Gui Hecheng <guihc.fnst@cn.fujitsu.com> |
btrfs: filter invalid arg for btrfs resize Originally following cmds will work: # btrfs fi resize -10A <mnt> # btrfs fi resize -10Gaha <mnt> Filter the arg by checking the return pointer of memparse. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
84dbeb87 |
|
28-Mar-2014 |
Dan Carpenter <dan.carpenter@oracle.com> |
Btrfs: kmalloc() doesn't return an ERR_PTR The error handling was copy and pasted from memdup_user(). It should be checking for NULL obviously. Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
00fdf13a |
|
10-Mar-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix a crash of clone with inline extents's split xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split of inline extents in __btrfs_drop_extents(). For inline extents, we cannot duplicate another EXTENT_DATA item, because it breaks the rule of inline extents, that is, 'start offset' needs to be 0. We have set limitations for the source inode's compressed inline extents, because it needs to decompress and recompress. Now the destination inode's inline extents also need similar limitations. With this, xfstests btrfs/035 doesn't run into panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
f094c9bd |
|
11-Mar-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: less fs tree lock contention when using autodefrag When finding new extents during an autodefrag, don't do so many fs tree lookups to find an extent with a size smaller then the target treshold. Instead, after each fs tree forward search immediately unlock upper levels and process the entire leaf while holding a read lock on the leaf, since our leaf processing is very fast. This reduces lock contention, allowing for higher concurrency when other tasks want to write/update items related to other inodes in the fs tree, as we're not holding read locks on upper tree levels while processing the leaf and we do less tree searches. Test: sysbench --test=fileio --file-num=512 --file-total-size=16G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \ --max-requests=10000000000 [prepare|run] (fileystem mounted with -o autodefrag, averages of 5 runs) Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
72de6b53 |
|
11-Mar-2014 |
Guangyu Sun <guangyu.sun@oracle.com> |
Btrfs: return EPERM when deleting a default subvolume The error message is confusing: # btrfs sub delete /mnt/mysub/ Delete subvolume '/mnt/mysub' ERROR: cannot delete '/mnt/mysub' - Directory not empty The error message does not make sense to me: It's not about deleting a directory but it's a subvolume, and it doesn't matter if the subvolume is empty or not. Maybe EPERM or is more appropriate in this case, combined with an explanatory kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because it is configured as default subvolume.") Reported-by: Koen De Wit <koen.de.wit@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
308d9800 |
|
11-Mar-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: cache extent states in defrag code path When locking file ranges in the inode's io_tree, cache the first extent state that belongs to the target range, so that when unlocking the range we don't need to search in the io_tree again, reducing cpu time and making and therefore holding the io_tree's lock for a shorter period. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
6c255e67 |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock We needn't flush all delalloc inodes when we doesn't get s_umount lock, or we would make the tasks wait for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
8257b2dc |
|
05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
e2127cf0 |
|
01-Mar-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: make defrag not fragment files when using prealloc extents When using prealloc extents, a file defragment operation may actually fragment the file and increase the amount of data space used by the file. This change fixes that behaviour. Example: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt $ cd /mnt $ xfs_io -f -c 'falloc 0 1048576' foobar && sync $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync Before defragmenting the file: $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.25MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 0 nr 4096 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 4096 nr 102400 ram 1048576 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 106496 nr 90112 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 196608 nr 106496 ram 1048576 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 897024 nr 106496 ram 1048576 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 1003520 nr 45056 (...) Now defragmenting the file results in more data space used than before: $ btrfs filesystem defragment -f foobar && sync $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.55MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 And the corresponding file extent items are now no longer perfectly sequential as before, and we're now needlessly using more space from data block groups: $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 0 nr 4096 ram 1048576 extent compression 0 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 13893632 nr 102400 extent data offset 0 nr 102400 ram 102400 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 106496 nr 90112 ram 1048576 extent compression 0 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 13996032 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 14102528 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 1003520 nr 45056 ram 1048576 extent compression 0 (...) With this change, the above example will no longer cause allocation of new data space nor change the sequentiality of the file extents, that is, defragment will be effectless, leaving all extent items pointing to the extent starting at disk byte 12845056. In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of 400Mb each, initially consisting of a single prealloc extent of 400Mb, having random writes happening at a low rate, lead to a total of over ~17Gb of data space used, not far from eventually reaching an ENOSPC state. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
dec8ef90 |
|
01-Mar-2014 |
Filipe Manana <fdmanana@gmail.com> |
Btrfs: correctly flush data on defrag when compression is enabled When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression enabled, we weren't flushing completely, as writing compressed extents is a 2 steps process, one to compress the data and another one to write the compressed data to disk. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
abccd00f |
|
30-Jan-2014 |
Hugo Mills <hugo@carfax.org.uk> |
btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
23ad5b17 |
|
30-Jan-2014 |
Kusanagi Kouichi <slash@ac.auone-net.jp> |
btrfs: Return EXDEV for cross file system snapshot EXDEV seems an appropriate error if an operation fails bacause it crosses file system boundaries. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fb.com>
|
#
11bcac89 |
|
14-Feb-2014 |
Chris Mason <clm@fb.com> |
Revert "btrfs: add ioctl to export size of global metadata reservation" This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d. David Sterba found a different way to provide these features without adding a new ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out for now until we finalize things. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> CC: Jeff Mahoney <jeffm@suse.com>
|
#
8051aa1a |
|
07-Feb-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: reserve no transaction units in btrfs_ioctl_set_features Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d0270aca |
|
07-Feb-2014 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: commit transaction after setting label and features The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c41570c9 |
|
21-Jan-2014 |
Justin Maggard <jmaggard10@gmail.com> |
btrfs: fix defrag 32-bit integer overflow When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
bd60ea0f |
|
16-Jan-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: call permission checks earlier in ioctls and return EPERM The owner and capability checks in IOC_SUBVOL_SETFLAGS and SET_RECEIVED_SUBVOL should be called before any other checks are done. Also unify the error code to EPERM. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
d0242061 |
|
15-Jan-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: restrict snapshotting to own subvolumes Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
e4355f34 |
|
13-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: faster file extent item search in clone ioctl When we are looking for file extent items that intersect the cloning range, for each one that falls completely outside the range, don't release the path and do another full tree search - just move on to the next slot and copy the file extent item into our buffer only if the item intersects the cloning range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
c57c2b3e |
|
11-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: unlock inodes in correct order in clone ioctl In the clone ioctl, when the source and target inodes are different, we can acquire their mutexes in 2 possible different orders. After we're done cloning, we were releasing the mutexes always in the same order - the most correct way of doing it is to release them by the reverse order they were acquired. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
de6e8200 |
|
08-Jan-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: release subvolume's block_rsv before transaction commit We don't have to keep subvolume's block_rsv during transaction commit, and within transaction commit, we may also need the free space reclaimed from this block_rsv to process delayed refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
63541927 |
|
07-Jan-2014 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: add support for inode properties This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: *) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; *) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
eb8052e0 |
|
20-Dec-2013 |
Wenliang Fan <fanwlexca@gmail.com> |
fs/btrfs: Integer overflow in btrfs_ioctl_resize() The local variable 'new_size' comes from userspace. If a large number was passed, there would be an integer overflow in the following line: new_size = old_size + new_size; Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
efe120a0 |
|
20-Dec-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2c686537 |
|
16-Dec-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: Check read-only status of roots during send All the subvolues that are involved in send must be read-only during the whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the status to read-write and the result of send stream is undefined if the data change unexpectedly. Fix that by adding a refcount for all involved roots and verify that there's no send in progress during SUBVOL_SETFLAGS ioctl call that does read-only -> read-write transition. We need refcounts because there are no restrictions on number of send parallel operations currently run on a single subvolume, be it source, parent or one of the multiple clone sources. Kernel is silent when the RO checks fail and returns EPERM. The same set of checks is done already in userspace before send starts. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
5662344b3 |
|
12-Dec-2013 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: fix error check of btrfs_lookup_dentry() Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT) instead. This keeps the return value convention consistent. Callers who use btrfs_lookup_dentry() require a trivial update. create_snapshot() in particular looks like it can also lose a BUG_ON(!inode) which is not really needed - there seems less harm in returning ENOENT to userspace at that point in the stack than there is to crash the machine. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
01e219e8 |
|
01-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add ioctl to export size of global metadata reservation btrfs filesystem df output will show the size of the metadata space and how much of it is used, and the user assumes that the difference is all usable space. Since that's not actually the case due to the global metadata reservation, we should provide the full picture to the user. This patch adds an ioctl that exports the size of the global metadata reservation so that btrfs filesystem df can report it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
3b02a68a |
|
01-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: use feature attribute names to print better error messages Now that we have the feature name strings available in the kernel via the sysfs attributes, we can use them for printing better failure messages from the ioctl path. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
2eaa055f |
|
15-Nov-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: add ioctls to query/change feature bits online There are some feature bits that require no offline setup and can be enabled online. I've only reviewed extended irefs, but there will probably be more. We introduce three new ioctls: - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features. - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs basis, as well as querying for which features are changeable with mounted. - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis. We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that allow us to define which features are safe to change at runtime. The failure modes for BTRFS_IOC_SET_FEATURES are as follows: - Enabling a completely unsupported feature: warns and returns -ENOTSUPP - Enabling a feature that can only be done offline: warns and returns -EPERM Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
|
#
1c1c8747 |
|
11-Dec-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME * don't assume that ->dest_count won't change between copy_from_user() and memdup_user() * use fdget instead of fget * don't bother comparing superblocks when we'd already compared vfsmounts * get rid of excessive goto * use file_inode() instead of open-coding the sucker Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e43f998e |
|
06-Dec-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: call mnt_drop_write after interrupted subvol deletion If btrfs_ioctl_snap_destroy blocks on the mutex and the process is killed, mnt_write count is unbalanced and leads to unmountable filesystem. CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
|
#
54563d41 |
|
01-Sep-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: get rid of fdentry() 3 of 4 callers actually want file_inode()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
46e0f66a |
|
14-Nov-2013 |
Chris Mason <chris.mason@fusionio.com> |
btrfs: fix empty_zero_page misusage Heiko Carstens noticed that btrfs was using empty_zero_page incorrectly. He explained: The definition of empty_zero_page is architecture specific. It is (currently) either a character array, an unsigned long containing the address of the empty_zero_page, or even worse only the address of the struct page belonging to the empty_zero_page. This commit changes btrfs to use a for-loop instead. On x86 the resulting .ko is smaller, and we're no longer worrying about how each arch builds its zeros. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
91aef86f |
|
04-Nov-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: rename btrfs_start_all_delalloc_inodes rename the function -- btrfs_start_all_delalloc_inodes(), and make its name be compatible to btrfs_wait_ordered_roots(), since they are always used at the same place. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b0244199 |
|
04-Nov-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't wait for the completion of all the ordered extents It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want to reclaim some space for the metadata space reservation, we would be blocked for a long time. The performance would drop down suddenly for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
67871254 |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Fix checkpatch.pl warning of spacing issues Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d9b0d9ba |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Replace kmalloc with kmalloc_array Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making it easier to check is that the calculation doesn't wrap or return a smaller allocation Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b19e6843 |
|
30-Oct-2013 |
Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> |
btrfs: Remove redundant local zero structure Remove redundant local zero structure, replacing it by the kernel's global ZERO_PAGE. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
8b558c5f |
|
16-Oct-2013 |
Zach Brown <zab@redhat.com> |
btrfs: remove fs/btrfs/compat.h fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink() and inc_nlink(). This doesn't belong in mainline. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
229eed43 |
|
13-Oct-2013 |
Geyslan G. Bem <geyslan@gmail.com> |
btrfs: simplify kmalloc+copy_from_user to memdup_user Use memdup_user rather than duplicating its implementation This is a little bit restricted to reduce false positives The semantic patch that makes this report is available in scripts/coccinelle/api/memdup_user.cocci. More information about semantic patching is available at http://coccinelle.lip6.fr/ Signed-off-by: Geyslan G. Bem <geyslan@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
adfa97cb |
|
10-Oct-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: don't leak ioctl args in btrfs_ioctl_dev_replace struct btrfs_ioctl_dev_replace_args memory is leaked if replace is requested on a read-only filesystem. Fix it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6174d3cb |
|
01-Oct-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: remove unused max_key arg from btrfs_search_forward It is not used for anything. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0a4e5586 |
|
24-Sep-2013 |
Ross Kirk <ross.kirk@gmail.com> |
btrfs: remove unused parameter from btrfs_header_fsid Remove unused parameter, 'eb'. Unused since introduction in 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9b199859 |
|
23-Sep-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix sync fs to actually wait for all data to be persisted Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't wait for delayed work to finish before returning success to the caller. This change fixes this, ensuring that there's no data loss if a power failure happens right after fs sync returns success to the caller and before the next commit happens. Steps to reproduce the data loss issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs Right after the btrfs fi sync command (a second or 2 for example), power off the machine and reboot it. The file will be empty, as it can be verified after mounting the filesystem and through btrfs-debug-tree: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29429760 items 0 free space 3995 generation 7 owner 7 fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae uuid tree key (UUID_TREE ROOT_ITEM 0) After this patch, the data loss no longer happens after a power failure and btrfs-debug-tree shows: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53 extent data disk byte 12845056 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 checksum tree key (CSUM_TREE ROOT_ITEM 0) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
cbf8b8ca |
|
17-Sep-2013 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: change extent-same to copy entire argument struct btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data back to it's argument struct. Unfortunately, not all architectures provide __put_user_unaligned(), so compiles break on them if btrfs is selected. Instead, just copy the whole struct in / out at the start and end of operations, respectively. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
1cecf579 |
|
13-Sep-2013 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default subvolume by passing a subvolume id of 0. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f0de181c |
|
17-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: kill delay_iput arg to the wait_ordered functions This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e57138b3 |
|
20-Aug-2013 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: return btrfs error code for dev excl ops err now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS as defined in btrfs.h for the dev excl operation error in the FS, which means with this kernel would stop logging (almost an user error) into the /var/log/messages v2: accepts Josef' comment Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
f7171750 |
|
12-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl The handler for the ioctl BTRFS_IOC_FS_INFO was reading the number of devices before acquiring the device list mutex. This could lead to inconsistent results because the update of the device list and the number of devices counter (amongst other counters related to the device list) are updated in volumes.c while holding the device list mutex - except for 2 places, one was volumes.c:btrfs_prepare_sprout() and the other was volumes.c:device_list_add(). For example, if we have 2 devices, with IDs 1 and 2 and then add a new device, with ID 3, and while adding the device is in progress an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of devices of 2 and a max dev id of 3. This would be incorrect. Also, this ioctl handler was reading the fsid while it can be updated concurrently. This can happen when while a new device is being added and the current filesystem is in seeding mode. Example: $ mkfs.btrfs -f /dev/sdb1 $ mkfs.btrfs -f /dev/sdb2 $ btrfstune -S 1 /dev/sdb1 $ mount /dev/sdb1 /mnt/test $ btrfs device add /dev/sdb2 /mnt/test If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it could read an fsid that was never valid (some bits part of the old fsid and others part of the new fsid). Also, it could read a number of devices that doesn't match the number of devices in the list and the max device id, as explained before. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b308bc2f |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
fba6aa75 |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Make btrfs_header_fsid() return unsigned long Internally, btrfs_header_fsid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
c1c9ff7c |
|
20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Remove superfluous casts from u64 to unsigned long long u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
bbb651e4 |
|
19-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: don't allow the replace procedure on read only filesystems If you start the replace procedure on a read only filesystem, at the end the procedure fails to write the updated dev_items to the chunk tree. The problem is that this error is not indicated except for a WARN_ON(). If the user now thinks that everything was done as expected and destroys the source device (with mkfs or with a hammer). The next mount fails with "failed to read chunk root" and the filesystem is gone. This commit adds code to fail the attempt to start the replace procedure if the filesystem is mounted read-only. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
633085c7 |
|
16-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: reset force_compress on btrfs_file_defrag failure After we set force_compress with a new value (which was not being done while holding the inode mutex), if an error happens and we jump to the label out_ra, the force_compress property of the inode is not set to BTRFS_COMPRESS_NONE (unlike in the case where no errors happen). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
dd5f9615 |
|
15-Aug-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: maintain subvolume items in the UUID tree When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
18674c6c |
|
13-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP If the inode ref key was not found and the current leaf slot was 0 (first item in the leaf) the code would always return -ENOENT. This was not correct because the desired inode ref item might be the last item in the previous leaf. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a696cf35 |
|
13-Aug-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler If the path doesn't fit in the input buffer, return ENAMETOOLONG instead of returning with a success code (0) and a partially filled and right justified buffer. Also removed useless buffer pointer check outside the while loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
175a2b87 |
|
12-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: don't allow a subvol to be deleted if it is the default subovl Eric pointed out that btrfs will happily allow you to delete the default subvol. This is a problem obviously since the next time you go to mount the file system it will freak out because it can't find the root. Fix this by adding a check to see if our default subvol points to the subvol we are trying to delete, and if it does not allowing it to happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
416161db |
|
06-Aug-2013 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: offline dedupe This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to de-duplicate a list of extents across a range of files. Internally, the ioctl re-uses code from the clone ioctl. This avoids rewriting a large chunk of extent handling code. Userspace passes in an array of file, offset pairs along with a length argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison of the user data before deduping the extent. Status and number of bytes deduped are returned for each operation. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
32b7c687 |
|
06-Aug-2013 |
Mark Fasheh <mfasheh@suse.de> |
btrfs_ioctl_clone: Move clone code into it's own function There's some 250+ lines here that are easily encapsulated into their own function. I don't change how anything works here, just create and document the new btrfs_clone() function from btrfs_ioctl_clone() code. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
77fe20dc |
|
06-Aug-2013 |
Mark Fasheh <mfasheh@suse.de> |
btrfs: abtract out range locking in clone ioctl() The range locking in btrfs_ioctl_clone is trivially broken out into it's own function. This reduces the complexity of btrfs_ioctl_clone() by a small bit and makes that locking code available to future functions in fs/btrfs/ioctl.c Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
a1b83ac5 |
|
19-Jul-2013 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: fix get set label blocking against balance btrfs_ioctl_get_fslabel() and btrfs_ioctl_set_fslabel() used root->fs_info->volume_mutex mutex which caused operations like balance to block set/get label operation until its completion and generally balance operation takes a long time to complete, so it will be annoying to the user when cli appears hung also this patch will add a bit of optimization within the btrfs_ioctl_get_falabel() function. v1->v2: use fs_info->super_lock instead of uuid_mutex Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3cae210f |
|
15-Jul-2013 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert Some codes still use the cpu_to_lexx instead of the BTRFS_SETGET_STACK_FUNCS declared in ctree.h. Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec and other structures. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ee3441b4 |
|
09-Jul-2013 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: fall back to global reservation when removing subvolumes I recently did some ENOSPC testing that involved filling the disk while create and removing snapshots in a loop. During the test cycle, I ran into an ENOSPC when trying to remove a snapshot, leaving the fs stuck in ENOSPC even after a umount/mount cycle. This patch allow subvolume removal to fall back onto the global block reservation in order to succeed when it would have failed otherwise. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6d0379ec |
|
16-Jun-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: more open-coded file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a96fbc72 |
|
26-May-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: allow file data clone within a file We did not allow file data clone within the same file because of deadlock issues. However, we now use nested lock to avoid deadlock between the parent directory and the child file. So it's safe to do file clone within the same file when the two ranges are not overlapped. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
183860f6 |
|
17-May-2013 |
Anand Jain <Anand.Jain@oracle.com> |
btrfs: device delete to get errors from the kernel when user runs command btrfs dev del the raid requisite error if any goes to the /var/log/messages, its not good idea to clutter messages with these user (knowledge) errors, further user don't have to review the system messages to know problem with the cli it should be dropped to the user as part of the cli return. to bring this feature created a set of the ERROR defined BTRFS_ERROR_DEV* error codes and created their error string. I expect this enum to be added with other error which we might want to communicate to the user land v3: moved the code with in the file no logical change v1->v2: introduce error codes for the device mgmt usage v1: adds a parameter in the ioctl arg struct to carry the error string Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
6a03843d |
|
15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Before applying this patch, we need flush all the delalloc inodes in the fs when we want to create a snapshot, it wastes time, and make the transaction commit be blocked for a long time. It means some other user operation would also be blocked for a long time. This patch improves this problem, we just flush the delalloc inodes that in the source trees before snapshot creation, so the transaction commit will complete quickly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
3c64a1ab |
|
13-May-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: cleanup: don't check the same thing twice btrfs_read_fs_root_no_name() already checks if btrfs_root_refs() is zero and returns ENOENT in this case. There is no need to do it again in six places. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5798b92d |
|
13-May-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: remove useless copy in quota_ctl We don't need to copy it back to user side as it remains unchanged. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
57254b6e |
|
06-May-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: add ioctl to wait for qgroup rescan completion btrfs_qgroup_wait_for_completion waits until the currently running qgroup operation completes. It returns immediately when no rescan process is in progress. This is useful to automate things around the rescan process (e.g. testing). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
03b71c6c |
|
06-May-2013 |
Gabriel de Perthuis <g2p.code@gmail.com> |
btrfs: don't stop searching after encountering the wrong item The search ioctl skips items that are too large for a result buffer, but inline items of a certain size occuring before any search result is found would trigger an overflow and stop the search entirely. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641 Cc: stable@vger.kernel.org Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
55793c0d |
|
26-Apr-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: read entire device info under lock There's a theoretical possibility of reading stale (or even more theoretically, freed) data from DEV_INFO ioctl when the device would disappear between an early mutex unlock and data being copied from the device structure. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
48a3b636 |
|
25-Apr-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: make static code static & remove dead code Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2f232036 |
|
25-Apr-2013 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: rescan for qgroups If qgroup tracking is out of sync, a rescan operation can be started. It iterates the complete extent tree and recalculates all qgroup tracking data. This is an expensive operation and should not be used unless required. A filesystem under rescan can still be umounted. The rescan continues on the next mount. Status information is provided with a separate ioctl while a rescan operation is in progress. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
0abd5b17 |
|
16-Apr-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: return error when we specify wrong start to defrag We need such a sanity check for wrong start when we defrag a file, otherwise, even with a wrong start that's larger than file size, we can end up changing not only inode's force compress flag but also FS's incompat flags. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
92f183aa |
|
08-Apr-2013 |
Wang Shilong <wangsl-fnst@cn.fujitsu.com> |
Btrfs: use tree_root to avoid edquot when disabling quota Steps to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs sub create <mnt>/subv btrfs qgroup limit 10K <mnt>/subv btrfs quota disable <mnt>/subv It is wrong for qgroup to reserve when disabling quota, so just use tree_root to avoid edquot when disabling quota. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
7708f029 |
|
07-Apr-2013 |
Wang Shilong <wangsl-fnst@cn.fujitsu.com> |
Btrfs: creating the subvolume qgroup automatically when enabling quota Creating the subvolume/snapshots(including root subvolume) qgroup auotomatically when enabling quota. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5c50c9b8 |
|
22-Mar-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: make subvol creation/deletion killable in the early stages The subvolume ioctls block on the parent directory mutex that can be held by other concurrent snapshot activity for a long time. Give the user at least some chance to get out of this situation by allowing to send a kill signal. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
9b53157a |
|
04-Mar-2013 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: allow running defrag in parallel to administrative tasks Commit 5ac00add added a testnset mutex and code that disallows running administrative tasks in parallel. It is prevented that the device add/delete/balance/replace/resize operations are started in parallel. By mistake, the defragmentation operation was included in the check for mutually exclusiveness as well. This is fixed with this commit. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
00d71c9c |
|
04-Mar-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix unclosed transaction handler when the async transaction commitment fails If the async transaction commitment failed, we need close the current transaction handler, or the current transaction will be blocked to commit because of this orphan handler. We fix the problem by doing sync transaction commitment, that is to invoke btrfs_commit_transaction(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
aec8030a |
|
04-Mar-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails There are several bugs at error path of create_snapshot() when the transaction commitment failed. - access the freed transaction handler. At the end of the transaction commitment, the transaction handler was freed, so we should not access it after the transaction commitment. - we were not aware of the error which happened during the snapshot creation if we submitted a async transaction commitment. - pending snapshot access vs pending snapshot free. when something wrong happened after we submitted a async transaction commitment, the transaction committer would cleanup the pending snapshots and free them. But the snapshot creators were not aware of it, they would access the freed pending snapshots. This patch fixes the above problems by: - remove the dangerous code that accessed the freed handler - assign ->error if the error happens during the snapshot creation - the transaction committer doesn't free the pending snapshots, just assigns the error number and evicts them before we unblock the transaction. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
c58aaad2 |
|
28-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume When deleting a snapshot/subvolume, we need remove root ref/backref, dir entries and update the dir inode, so we must reserve free space for those operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d5c12070 |
|
28-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong reserved space in qgroup during snap/subv creation There are two problems in the space reservation of the snapshot/ subvolume creation. - don't reserve the space for the root item insertion - the space which is reserved in the qgroup is different with the free space reservation. we need reserve free space for 7 items, but in qgroup reservation, we need reserve space only for 3 items. So we implement new metadata reservation functions for the snapshot/subvolume creation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e9662f70 |
|
28-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot Since we have grabbed the parent inode at the beginning of the snapshot creation, and both sync and async snapshot creation release it after the pending snapshots are actually created, it is safe to access the parent inode directly during the snapshot creation, we needn't use dget_parent/dput to fix the parent dentry and get the dir inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
496ad9aa |
|
23-Jan-2013 |
Al Viro <viro@zeniv.linux.org.uk> |
new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
fa6ac876 |
|
20-Feb-2013 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix cleaner thread not working with inode cache option Right now inode cache inode is treated as the same as space cache inode, ie. keep inode in memory till putting super. But this leads to an awkward situation. If we're going to delete a snapshot/subvolume, btrfs will not actually delete it and return free space, but will add it to dead roots list until the last inode on this snap/subvol being destroyed. Then we'll fetch deleted roots and cleanup them via cleaner thread. So here is the problem, if we enable inode cache option, each snap/subvol has a cached inode which is used to store inode allcation information. And this cache inode will be kept in memory, as the above said. So with inode cache, snap/subvol can only be added into dead roots list during freeing roots stage in umount, so that we can ONLY get space back after another remount(we cleanup dead roots on mount). But the real thing is we'll no more use the snap/subvol if we mark it deleted, so we can safely iput its cache inode when we delete snap/subvol. Another thing is that we need to change the rules of droping inode, we don't keep snap/subvol's cache inode in memory till end so that we can add snap/subvol into dead roots list in time. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
d4edf39b |
|
20-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix uncompleted transaction In some cases, we need commit the current transaction, but don't want to start a new one if there is no running transaction, so we introduce the function - btrfs_attach_transaction(), which can catch the current transaction, and return -ENOENT if there is no running transaction. But no running transaction doesn't mean the current transction completely, because we removed the running transaction before it completes. In some cases, it doesn't matter. But in some special cases, such as freeze fs, we hope the transaction is fully on disk, it will introduce some bugs, for example, we may feeze the fs and dump the data in the disk, if the transction doesn't complete, we would dump inconsistent data. So we need fix the above problem for those cases. We fixes this problem by introducing a function: btrfs_attach_transaction_barrier() if we hope all the transaction is fully on the disk, even they are not running, we can use this function. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
a8bfd4ab |
|
04-Jan-2013 |
jeff.liu <jeff.liu@oracle.com> |
Btrfs: set/change the label of a mounted file system With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
867ab667 |
|
04-Jan-2013 |
jeff.liu <jeff.liu@oracle.com> |
Btrfs: Add a new ioctl to get the label of a mounted file system Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Goffredo Baroncelli <kreijack@inwind.it> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
210549eb |
|
09-Feb-2013 |
David Sterba <dsterba@suse.cz> |
btrfs: add cancellation points to defrag The defrag operation can take very long, we want to have a way how to cancel it. The code checks for a pending signal at safe points in the defrag loops and returns EAGAIN. This means a user can press ^C after running 'btrfs fi defrag', woks for both defrag modes, files and root. Returning from the command was instant in my light tests, but may take longer depending on the aging factor of the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
8696c533 |
|
06-Feb-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix memory leak of pending_snapshot->inherit The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned to NULL during we created the snapshots, so we didn't free it though we called kfree() in the caller. But since we are sure the snapshot creation is done after the function - btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't assign the pointer "inherit" to NULL, and just free it in the caller of btrfs_ioctl_snap_create_transid(). In this way, the code can become more readable. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
de78b51a |
|
31-Jan-2013 |
Eric Sandeen <sandeen@redhat.com> |
btrfs: remove cache only arguments from defrag path The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
55e301fd |
|
28-Jan-2013 |
Filipe Brandenburger <filbranden@google.com> |
Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h The header file will then be installed under /usr/include/linux so that userspace applications can refer to Btrfs ioctls by name and use the same structs used internally in the kernel. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
82b22ac8 |
|
28-Jan-2013 |
Kusanagi Kouichi <slash@ac.auone-net.jp> |
Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS CAP_DAC_READ_SEARCH overrides read and search permission check on file and directory. It seems fit for BTRFS_IOC_INO_PATHS. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
dfd79829 |
|
21-Dec-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix trivial error in btrfs_ioctl_resize() This patch fixes the following problem: - improper return value - unnecessary read-only check Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
1a65e24b |
|
05-Feb-2013 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: move d_instantiate outside the transaction during mksubvol Dave Sterba triggered a lockdep complaint about lock ordering between the sb_internal lock and the cleaner semaphore. btrfs_lookup_dentry() checks for orphans if we're looking up the inode for a subvolume, and subvolume creation is triggering the lookup with a transaction running. This commit moves the d_instantiate after the transaction closes. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
25122d15 |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Operation-specific check (whether subvol is readonly or not) should go after the mutual exclusiveness check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
4ac20c70 |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix unlock order in btrfs_ioctl_rm_dev Fix unlock order in btrfs_ioctl_rm_dev(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
18f39c41 |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix unlock order in btrfs_ioctl_resize Fix unlock order in btrfs_ioctl_resize(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
2c0c9da0 |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: fix "mutually exclusive op is running" error code The error code that is returned in response to starting a mutually exclusive operation when there is one already running got silently changed from EINVAL to EINPROGRESS by 5ac00add. Returning EINPROGRESS to, say, add_dev, when rm_dev is running is misleading. Furthermore, the operation itself may want to use EINPROGRESS for other purposes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
ed0fb78f |
|
20-Jan-2013 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: bring back balance pause/resume logic Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1 as part of dev-replace merge). Offending commit took a stab at making mutually exclusive volume operations (add_dev, rm_dev, resize, balance, replace_dev) not block behind volume_mutex if another such operation is in progress and instead return an error right away. Balancing front-end relied on the blocking behaviour, so the fix is ugly, but short of a complete rework, it's the best we can do. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
97547676 |
|
21-Dec-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix missing write access release in btrfs_ioctl_resize() We forget to give up the write access after we find some device operation is going on. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
dba60f3f |
|
21-Dec-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix resize a readonly device We should not resize a readonly device, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5c39da5b |
|
22-Oct-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: do not delete a subvolume which is in a R/O subvolume Step to reproduce: # mkfs.btrfs <disk> # mount <disk> <mnt> # btrfs sub create <mnt>/subv0 # btrfs sub snap <mnt> <mnt>/subv0/snap0 # change <mnt>/subv0 from R/W to R/O # btrfs sub del <mnt>/subv0/snap0 We deleted the snapshot successfully. I think we should not be able to delete the snapshot since the parent subvolume is R/O. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
d86e56cf |
|
15-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: disable qgroup id 0 Qgroup id 0 is a special number, we should set the id of a qgroup to 0. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
213490b3 |
|
11-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix a bug of per-file nocow Users report a bug, the reproducer is: $ mkfs.btrfs /dev/loop0 $ mount /dev/loop0 /mnt/btrfs/ $ mkdir /mnt/btrfs/dir $ chattr +C /mnt/btrfs/dir/ $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10; $ lsattr /mnt/btrfs/dir/foo ---------------C- /mnt/btrfs/dir/foo $ filefrag /mnt/btrfs/dir/foo /mnt/btrfs/dir/foo: 1 extent found ---> an extent $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync $ filefrag /mnt/btrfs/dir/foo /mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts The new created file should not only inherit the NODATACOW flag, but also honor NODATASUM flag, because we must do COW on a file extent with checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9c52057c |
|
17-Dec-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: fix hash overflow handling The handling for directory crc hash overflows was fairly obscure, split_leaf returns EOVERFLOW when we try to extend the item and that is supposed to bubble up to userland. For a while it did so, but along the way we added better handling of errors and forced the FS readonly if we hit IO errors during the directory insertion. Along the way, we started testing only for EEXIST and the EOVERFLOW case was dropped. The end result is that we may force the FS readonly if we catch a directory hash bucket overflow. This fixes a few problem spots. First I add tests for EOVERFLOW in the places where we can safely just return the error up the chain. btrfs_rename is harder though, because it tries to insert the new directory item only after it has already unlinked anything the rename was going to overwrite. Rather than adding very complex logic, I added a helper to test for the hash overflow case early while it is still safe to bail out. Snapshot and subvolume creation had a similar problem, so they are using the new helper now too. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Pascal Junod <pascal@junod.info>
|
#
905b0dda |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: get write access for qgroup operations We need get write access for qgroup operations, or we will modify the R/O fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
b8e95489 |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: get write access for scrub We need get write access for scrub, or we will modify the R/O fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
da24927b |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: get write access when removing a device Steps to reproduce: # mkfs.btrfs -d single -m single <disk0> <disk1> # mount -o ro <disk0> <mnt0> # mount -o ro <disk0> <mnt1> # mount -o remount,rw <mnt0> # umount <mnt0> # btrfs device delete <disk1> <mnt1> We can remove a device from a R/O filesystem. The reason is that we just check the R/O flag of the super block object. It is not enough, because the kernel may set the R/O flag only for the mount point. We need invoke mnt_want_write_file() to do a full check. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
198605a8 |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: get write access when doing resize fs Steps to reproduce: # mkfs.btrfs <partition> # mount -o ro <partition> <mnt0> # mount -o ro <partition> <mnt1> # mount -o remount,rw <mnt0> # umount <mnt0> # btrfs fi resize 10g <mnt1> We re-sized a R/O filesystem. The reason is that we just check the R/O flag of the super block object. It is not enough, because the kernel may set the R/O flag only for the mount point. We need invoke mnt_want_write_file() to do a full check. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3c04ce01 |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: get write access when setting the default subvolume When wen want to set the default subvolume, we must get write access, or we will change the R/O file system. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
ff7c1d33 |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't start a new transaction when starting sync If there is no running transaction in the fs, we needn't start a new one when we want to start sync. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
9a8c28be |
|
26-Nov-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: pass root object into btrfs_ioctl_{start, wait}_sync() Since we have gotten the root in the caller, just pass it into btrfs_ioctl_{start, wait}_sync() directly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
3f6bcfbd |
|
06-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add support for device replace ioctls This is the commit that allows to start the device replace procedure. An ioctl() interface is added that supports starting and canceling the device replace procedure, and to retrieve the status and progress. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
63a212ab |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow some operations on the device replace target device This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
5ac00add |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: disallow mutually exclusive admin operations from user mode Btrfs admin operations that are manually started from user mode and that cannot be executed at the same time return -EINPROGRESS. A common way to enter and leave this locked section is introduced since it used to be specific to the balance operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
aa1b8cd4 |
|
05-Nov-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: pass fs_info instead of root A small number of functions that are used in a device replace procedure when the operation is resumed at mount time are unable to pass the same root pointer that would be used in the regular (ioctl) context. And since the root pointer is not required, only the fs_info is, the root pointer argument is replaced with the fs_info pointer argument. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
109f2365 |
|
04-Nov-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix a double free on pending snapshots in error handling When creating a snapshot, failing to commit a transaction can end up with aborting the transaction, following by doing a cleanup for it, where we'll free all snapshots pending to disk. So we check it and avoid double free on pending snapshots. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
0253f40e |
|
26-Oct-2012 |
jeff.liu <jeff.liu@oracle.com> |
Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev() Remove an invalid size check up from btrfs_shrink_dev(). The new size should not larger than the device->total_bytes as it was already verified before coming to here(i.e. new_size < old_size). Remove invalid check up for btrfs_shrink_dev(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
d0e1d66b |
|
11-Dec-2012 |
Namjae Jeon <linkinjeon@gmail.com> |
writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr() There is no reason to pass the nr_pages_dirtied argument, because nr_pages_dirtied value from the caller is unused in balance_dirty_pages_ratelimited_nr(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c37b2b62 |
|
22-Oct-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: do not bug when we fail to commit the transaction We BUG if we fail to commit the transaction when creating a snapshot, which is just obnoxious. Remove the BUG_ON(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
e515c18b |
|
16-Oct-2012 |
Lukas Czerner <lczerner@redhat.com> |
btrfs: Return EINVAL when length to trim is less than FSB Currently if len argument in btrfs_ioctl_fitrim() is smaller than one FSB we will continue and finally return 0 bytes discarded. However if the length to discard is smaller then file system block we should really return EINVAL. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
#
4fa6b5ec |
|
10-Oct-2012 |
Jeff Layton <jlayton@kernel.org> |
audit: overhaul __audit_inode_child to accomodate retrying In order to accomodate retrying path-based syscalls, we need to add a new "type" argument to audit_inode_child. This will tell us whether we're looking for a child entry that represents a create or a delete. If we find a parent, don't automatically assume that we need to create a new entry. Instead, use the information we have to try to find an existing entry first. Update it if one is found and create a new one if not. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
c43a25ab |
|
10-Oct-2012 |
Jeff Layton <jlayton@kernel.org> |
audit: reverse arguments to audit_inode_child Most of the callers get called with an inode and dentry in the reverse order. The compiler then has to reshuffle the arg registers and/or stack in order to pass them on to audit_inode_child. Reverse those arguments for a micro-optimization. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
5af3e8cc |
|
01-Aug-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: make filesystem read-only when submitting barrier fails So far the return code of barrier_all_devices() is ignored, which means that errors are ignored. The result can be a corrupt filesystem which is not consistent. This commit adds code to evaluate the return code of barrier_all_devices(). The normal btrfs_error() mechanism is used to switch the filesystem into read-only mode when errors are detected. In order to decide whether barrier_all_devices() should return error or success, the number of disks that are allowed to fail the barrier submission is calculated. This calculation accounts for the worst RAID level of metadata, system and data. If single, dup or RAID0 is in use, a single disk error is already considered to be fatal. Otherwise a single disk error is tolerated. The calculation of the number of disks that are tolerated to fail the barrier operation is performed when the filesystem gets mounted, when a balance operation is started and finished, and when devices are added or removed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
aa42ffd9 |
|
18-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix off-by-one in file clone Btrfs uses inclusive range end for lock_extent(), unlock_extent() and related functions, so we made off-by-one errors in file clone. This fixes it and also fixes some style problems. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
7e97b8da |
|
07-Sep-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: allow setting NOCOW for a zero sized file via ioctl Hi, the patch si simple, but it has user visible impact and I'm not quite sure how to resolve it. In short, $subj says it, chattr -C supports it and we want to use it. The conditions that acutally allow to change the NOCOW flag are clear. What if I try to set the flag on a file that is not empty? Options: 1) whole ioctl will fail, EINVAL 2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file will stay COW-ed and checksummed 2.2) ioctl will succeed, flag will not be removed and a syslog message will warn that the COW flag has not been changed 2.2.1) dtto, no syslog message Man page of chattr states that "If it is set on a file which already has data blocks, it is undefined when the blocks assigned to the file will be fully stable." Yes, it's undefined and with current implementation it'll never happen. So from this end, the user cannot expect anything. I'm trying to find a reasonable behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of flags in a deep directory does not fail unnecessarily and does not pollute the log. My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting the fact that I know the code and otherwise would look there before consulting the documentation. The patch implements 2.2.1. david -------------8<------------------- From: David Sterba <dsterba@suse.cz> It's safe to turn off checksums for a zero sized file. http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030 "We cannot switch on NODATASUM for a file that already has extents that are checksummed. The invariant here is that either all the extents or none are checksummed. Theoretically it's possible to add/remove all checksums from a given file, but it's a potentially longtime operation, the file has to be in some intermediate state where the checksums partially exist but have to be ignored (for the csum->nocsum) until the file is fully converted, this brings more special cases to extent handling, it has to survive power failure and remain consistent, and probably needs to be restarted after next mount." Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
425d17a2 |
|
07-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: use larger limit for translation of logical to inode This is the change of the kernel side. Translation of logical to inode used to have an upper limit 4k on inode container's size, but the limit is not large enough for a data with a great many of refs, so when resolving logical address, we can end up with "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493" This changes to regard 64k as the upper limit and use vmalloc instead of kmalloc to get memory more easily. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
df031f07 |
|
07-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: use helper for logical resolve We already have a helper, iterate_inodes_from_logical(), for logical resolve, so just use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
69917e43 |
|
07-Sep-2012 |
Liu Bo <liub.liubo@gmail.com> |
Btrfs: fix a bug in parsing return value in logical resolve In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag. It is possible to lose our errors because (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true. I'm not sure if it is on purpose, it just looks too hacky if it is. I'd rather use a real flag and a 'ret' to catch errors. Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Liu Bo <liub.liubo@gmail.com>
|
#
9e8a4a8b |
|
05-Sep-2012 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
#
48c03c4b |
|
06-Sep-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix wrong size for the reservation of the, snapshot creation We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
66d8f3dd |
|
06-Sep-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
#
2671485d |
|
28-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: remove unused hint byte argument for btrfs_drop_extents I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
5dc562c5 |
|
17-Aug-2012 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
#
2903ff01 |
|
27-Aug-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch simple cases of fget_light to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8319aa91 |
|
27-Aug-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch btrfs_ioctl_clone() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ecd18815 |
|
26-Aug-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch btrfs_ioctl_snap_create_transid() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2f2f43d3 |
|
10-Feb-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
userns: Convert btrfs to use kuid/kgid where appropriate Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
#
dadd1105 |
|
30-Jul-2012 |
Dan Carpenter <dan.carpenter@oracle.com> |
Btrfs: fix some endian bugs handling the root times "trans->transid" is cpu endian but we want to store the data as little endian. "item->ctime.nsec" is only 32 bits, not 64. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
|
#
e00da206 |
|
02-Aug-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: remove mnt_want_write call in btrfs_mksubvol We got a recursive lock in mksubvol because the caller already held a lock. I think we got into this due to a merge error. Commit a874a63 removed the mnt_want_write call from btrfs_mksubvol and added a replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid. Commit e7848683 however tried to move all calls to mnt_want_write above i_mutex. So somewhere while merging this, it got mixed up. The solution is to remove the mnt_want_write call completely from mksubvol. Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
e7848683 |
|
12-Jun-2012 |
Jan Kara <jack@suse.cz> |
btrfs: Push mnt_want_write() outside of i_mutex When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason <chris.mason@oracle.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
31db9f7c |
|
25-Jul-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive This patch introduces the BTRFS_IOC_SEND ioctl that is required for send. It allows btrfs-progs to implement full and incremental sends. Patches for btrfs-progs will follow. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
|
#
8ea05e3a |
|
25-Jul-2012 |
Alexander Block <ablock84@googlemail.com> |
Btrfs: introduce subvol uuids and times This patch introduces uuids for subvolumes. Each subvolume has it's own uuid. In case it was snapshotted, it also contains parent_uuid. In case it was received, it also contains received_uuid. It also introduces subvolume ctime/otime/stime/rtime. The first two are comparable to the times found in inodes. otime is the origin/creation time and ctime is the change time. stime/rtime are only valid on received subvolumes. stime is the time of the subvolume when it was sent. rtime is the time of the subvolume when it was received. Additionally to the times, we have a transid for each time. They are updated at the same place as the times. btrfs receive uses stransid and rtransid to find out if a received subvolume changed in the meantime. If an older kernel mounts a filesystem with the extented fields, all fields become invalid. The next mount with a new kernel will detect this and reset the fields. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
|
#
2b0ce2c2 |
|
24-Jul-2012 |
Mitch Harder <mitch.harder@sabayonlinux.org> |
Btrfs: Check INCOMPAT flags on remount and add helper function In support of the recently added capability to remount with lzo compression, provide a helper function to check the compression INCOMPAT flags when remounting with lzo compression, and set the flags if necessary. Also, implement the new helper function when defragmenting with explicit lzo compression and when setting the default subvolume. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
362a20c5 |
|
01-Aug-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: allow cross-subvolume file clone Lift the EXDEV condition and allow different root trees for files being cloned, then pass source inode's root when searching for extents. Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from one filesystem are mounted separately. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
b9ca0664 |
|
29-Jun-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: do not set subvolume flags in readonly mode $ mkfs.btrfs /dev/sdb7 $ btrfstune -S1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs mount: block device /dev/sdb7 is write-protected, mounting read-only $ btrfs dev add /dev/sdb8 /mnt/btrfs/ Now we get a btrfs in which mnt flags has readonly but sb flags does not. So for those ioctls that only check sb flags with MS_RDONLY, it is going to be a problem. Setting subvolume flags is such an ioctl, we should use mnt_want_write_file() to check RO flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
#
e54bfa31 |
|
29-Jun-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: use mnt_want_write_file instead of mnt_want_write mnt_want_write_file is faster when file has been opened for write. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
#
768e9dfe |
|
29-Jun-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: remove redundant r/o check for superblock mnt_want_write() and mnt_want_write_file() will check sb->s_flags with MS_RDONLY, and we don't need to do it ourselves. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
#
a874a63e |
|
29-Jun-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: check write access to mount earlier while creating snapshots Move check of write access to mount into upper functions so that we can use mnt_want_write_file instead, which is faster than mnt_want_write. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
#
b27f7c0c |
|
22-Jun-2012 |
David Sterba <dsterba@suse.cz> |
btrfs: join DEV_STATS ioctls to one Commit c11d2c236cc260b36 (Btrfs: add ioctl to get and reset the device stats) introduced two ioctls doing almost the same thing distinguished by just the ioctl number which encodes "do reset after read". I have suggested http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html to implement it via the ioctl args. This hasn't happen, and I think we should use a more clean way to pass flags and should not waste ioctl numbers. CC: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
a43a2111 |
|
19-Jun-2012 |
Andrew Mahone <andrew.mahone@gmail.com> |
btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased Rebased on btrfs-next and retested. Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip checks for adjacent extents and extent size when deciding whether to defrag, as these can prevent an uncompressed and unfragmented file from being compressed as requested. Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
|
#
11e62a8f |
|
19-Jul-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6f72c7e2 |
|
14-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add qgroup inheritance When creating a subvolume or snapshot, it is necessary to initialize the qgroup account with a copy of some other (tracking) qgroup. This patch adds parameters to the ioctls to pass the information from which qgroup to inherit. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
5d13a37b |
|
14-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: add qgroup ioctls Ioctls to control the qgroup feature like adding and removing qgroups and assigning qgroups. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
a8c4a33b |
|
15-Jun-2012 |
Chris Mason <chris.mason@fusionio.com> |
Btrfs: cast devid to unsigned long long for printk %llu Avoid warning in 32 bit machines Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
4e42ae1b |
|
14-Jun-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: do not resize a seeding device Seeding devices are not supposed to change any more. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
#
6c282eb4 |
|
11-Jun-2012 |
Li Zefan <lizefan@huawei.com> |
Btrfs: fix defrag regression If a file has 3 small extents: | ext1 | ext2 | ext3 | Running "btrfs fi defrag" will only defrag the last two extents, if those extent mappings hasn't been read into memory from disk. This bug was introduced by commit 17ce6ef8d731af5edac8c39e806db4c7e1f6956f ("Btrfs: add a check to decide if we should defrag the range") The cause is, that commit looked into previous and next extents using lookup_extent_mapping() only. While at it, remove the code that checks the previous extent, since it's sufficient to check the next extent. Signed-off-by: Li Zefan <lizefan@huawei.com>
|
#
606686ee |
|
04-Jun-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: use rcu to protect device->name Al pointed out that we can just toss out the old name on a device and add a new one arbitrarily, so anybody who uses device->name in printk could possibly use free'd memory. Instead of adding locking around all of this he suggested doing it with RCU, so I've introduced a struct rcu_string that does just that and have gone through and protected all accesses to device->name that aren't under the uuid_mutex with rcu_read_lock(). This protects us and I will use it for dealing with removing the device that we used to mount the file system in a later patch. Thanks, Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
c11d2c23 |
|
25-May-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: add ioctl to get and reset the device stats An ioctl interface is added to get the device statistic counters. A second ioctl is added to atomically get and reset these counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
9ba1f6e4 |
|
11-May-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: do not do balance in readonly mode In normal cases, we would not be allowed to do balance in RO mode. However, when we're using a seeding device and adding another device to sprout, things will change: $ mkfs.btrfs /dev/sdb7 $ btrfstune -S 1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs fi bal /mnt/btrfs -----------------------> fail. $ btrfs dev add /dev/sdb8 /mnt/btrfs $ btrfs fi bal /mnt/btrfs -----------------------> works! It should not be designed as an exception, and we'd better add another check for mnt flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
|
#
a27202fb |
|
26-Apr-2012 |
Jim Meyering <jim@meyering.net> |
Btrfs: NUL-terminate path buffer in DEV_INFO ioctl result A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer would not be NUL-terminated in the DEV_INFO ioctl result buffer. Signed-off-by: Jim Meyering <meyering@redhat.com>
|
#
2eec6c81 |
|
25-Apr-2012 |
Daniel J Blueman <daniel@quora.org> |
Fix minor type issues Address some minor type issues identified by sparse checker. Signed-off-by: Daniel J Blueman <daniel@quora.org>
|
#
0c4d2d95 |
|
05-Apr-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: use i_version instead of our own sequence We've been keeping around the inode sequence number in hopes that somebody would use it, but nobody uses it and people actually use i_version which serves the same purpose, so use i_version where we used the incore inode's sequence number and that way the sequence is updated properly across the board, and not just in file write. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
5581a51a |
|
16-May-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: don't set for_cow parameter for tree block functions Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
99ba55ad |
|
19-Mar-2012 |
Stefan Behrens <sbehrens@giantdisaster.de> |
Btrfs: fix btrfs_ioctl_dev_info() crash on missing device When a filesystem is mounted with the degraded option, it is possible that some of the devices are not there. btrfs_ioctl_dev_info() crashs in this case because the device name is a NULL pointer. This ioctl was only used for scrub. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
|
#
e1f041e1 |
|
29-Mar-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: update to the right index of defragment When we use autodefrag, we forget to update the index which indicates the last page we've dirty. And we'll set dirty flags on a same set of pages again and again. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
66c26892 |
|
29-Mar-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: do not bother to defrag an extent if it is a big real extent $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null $ filefrag -v /mnt/btrfs/foobar Filesystem type is: 9123683e File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3072 10 eof /mnt/btrfs/foobar: 1 extent found Now we have a big real extent [0, 40960), but autodefrag will still defrag it. $ sync $ filefrag -v /mnt/btrfs/foobar Filesystem type is: 9123683e File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3082 10 eof /mnt/btrfs/foobar: 1 extent found So if we already find a big real extent, we're ok about that, just skip it. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
17ce6ef8 |
|
29-Mar-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: add a check to decide if we should defrag the range If our file's layout is as follows: | hole | data1 | hole | data2 | we do not need to defrag this file, because this file has holes and cannot be merged into one extent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1f12bd06 |
|
29-Mar-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix the mismatch of page->mapping commit 600a45e1d5e376f679ff9ecc4ce9452710a6d27c (Btrfs: fix deadlock on page lock when doing auto-defragment) fixes the deadlock on page, but it also introduces another bug. A page may have been truncated after unlock & lock. So we need to find it again to get the right one. And since we've held i_mutex lock, inode size remains unchanged and we can drop isize overflow checks. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ecb8bea8 |
|
29-Mar-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: fix race between direct io and autodefrag The bug is from running xfstests 209 with autodefrag. The race is as follows: t1 t2(autodefrag) direct IO invalidate pagecache dio(old data) add_inode_defrag invalidate pagecache endio direct IO invalidate pagecache run_defrag readpage(old data) set page dirty (old data) dio(new data, rewrite) invalidate pagecache (*) endio t2(autodefrag) will get old data into pagecache via readpage and set pagecache dirty. Meanwhile, invalidate pagecache(*) will fail due to dirty flags in pages. So the old data may be flushed into disk by flush thread, which will lead to data loss. And so does the case of user defragment progs. The patch fixes this race by holding i_mutex when we readpage and set page dirty. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7a3ae2f8 |
|
23-Mar-2012 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: fix regression in scrub path resolving In commit 4692cf58 we introduced new backref walking code for btrfs. This assumes we're searching live roots, which requires a transaction context. While scrubbing, however, we must not join a transaction because this could deadlock with the commit path. Additionally, what scrub really wants to do is resolving a logical address in the commit root it's currently checking. This patch adds support for logical to path resolving on commit roots and makes scrub use that. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
79787eaa |
|
12-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: replace many BUG_ONs with proper error handling btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
ce598979 |
|
26-Jul-2011 |
Mark Fasheh <mfasheh@suse.com> |
btrfs: Don't BUG_ON errors from btrfs_create_subvol_root() This is called from only one place - create_subvol() which passes errors safely back out to it's caller, btrfs_mksubvol where they are handled. Additionally, btrfs_create_subvol_root() itself bug's needlessly from error return of btrfs_update_inode(). Since create_subvol() was fixed to catch errors we can bubble this one up too. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
|
#
d0082371 |
|
01-Mar-2012 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: drop gfp_t from lock_extent lock_extent and unlock_extent are always called with GFP_NOFS, drop the argument and use GFP_NOFS consistently. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
|
#
16780cab |
|
20-Feb-2012 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add extra sanity checks on the path names in btrfs_mksubvol Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
600a45e1 |
|
16-Feb-2012 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix deadlock on page lock when doing auto-defragment When I ran xfstests circularly on a auto-defragment btrfs, the deadlock happened. Steps to reproduce: [tty0] # export MOUNT_OPTIONS="-o autodefrag" # export TEST_DEV=<partition1> # export TEST_DIR=<mountpoint1> # export SCRATCH_DEV=<partition2> # export SCRATCH_MNT=<mountpoint2> # while [ 1 ] > do > ./check 091 127 263 > sleep 1 > done [tty1] # while [ 1 ] > do > echo 3 > /proc/sys/vm/drop_caches > done Several hours later, the test processes will hang on, and the deadlock will happen on page lock. The reason is that: Auto defrag task Flush thread Test task btrfs_writepages() add ordered extent (including page 1, 2) set page 1 writeback set page 2 writeback endio_fn() end page 2 writeback release page 2 lock page 1 alloc and lock page 2 page 2 is not uptodate btrfs_readpage() start ordered extent() btrfs_writepages() try to lock page 1 so deadlock happens. Fix this bug by unlocking the page which is in writeback, and re-locking it after the writeback end. Signed-off-by: Miao Xie <miax@cn.fujitsu.com>
|
#
7ec31b54 |
|
26-Jan-2012 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: do not defrag a file partially xfstests 218 complains that btrfs defrags a file partially: After: 1 Write backwards sync, but contiguous - should defrag to 1 extent Before: 10 -After: 1 +After: 2 To fix this, we need to set max_to_defrag count properly. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f248679e |
|
12-Jan-2012 |
Josef Bacik <josef@redhat.com> |
Btrfs: add a delalloc mutex to inodes for delalloc reservations I was using i_mutex for this, but we're getting bogus lockdep warnings by doing that and theres no real way to get rid of those, so just stop using i_mutex to protect delalloc metadata reservations and use a delalloc mutex instead. This shouldn't be contended often at all, only if you are writing and mmap writing to the file at the same time. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
19a39dce |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add balance progress reporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
de322263 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for resuming restriper after it was paused Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the same heuristics used when recovering balance after a crash to try to start where we left off last time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
a7e99c69 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for canceling restriper Implement an ioctl for canceling restriper. Currently we wait until relocation of the current block group is finished, in future this can be done by triggering a commit. Balance item is deleted and no memory about the interrupted balance is kept. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
837d5b6e |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: allow for pausing restriper Implement an ioctl for pausing restriper. This pauses the relocation, but balance is still considered to be "in progress": balance item is not deleted, other volume operations cannot be started, etc. If paused in the middle of profile changing operation we will continue making allocations with the target profile. Add a hook to close_ctree() to pause restriper and free its data structures on unmount. (It's safe to unmount when restriper is in "paused" state, we will resume with the same parameters on the next mount) Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
f43ffb60 |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add basic infrastructure for selective balancing This allows to have a separate set of filters for each chunk type (data,meta,sys). The code however is generic and switch on chunk type is only done once. This commit also adds a type filter: it allows to balance for example meta and system chunks w/o touching data ones. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
c9e9f97b |
|
16-Jan-2012 |
Ilya Dryomov <idryomov@gmail.com> |
Btrfs: add basic restriper infrastructure Add basic restriper infrastructure: extended balancing ioctl and all related ioctl data structures, add data structure for tracking restriper's state to fs_info, etc. The semantics of the old balancing ioctl are fully preserved. Explicitly disallow any volume operations when balance is in progress. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
#
4da6f1a3 |
|
28-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: reserve metadata space in btrfs_ioctl_setflags() Check and reserve space for btrfs_update_inode(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
f062abf0 |
|
28-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: remove BUG_ON()s in btrfs_ioctl_setflags() We can recover from errors and return -errno to user space. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
815745cf |
|
17-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
btrfs: let ->s_fs_info point to fs_info, not root... the latter can be obtained from the former (by looking as ->tree_root) just as cheaply as we currently are doing the other way round. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
4692cf58 |
|
02-Dec-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
Btrfs: new backref walking code The old backref iteration code could only safely be used on commit roots. Besides this limitation, it had bugs in finding the roots for these references. This commit replaces large parts of it by btrfs_find_all_roots() which a) really finds all roots and the correct roots, b) works correctly under heavy file system load, c) considers delayed refs. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
2a79f17e |
|
09-Dec-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: mnt_drop_write_file() new helper (wrapper around mnt_drop_write()) to be used in pair with mnt_want_write_file(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a561be71 |
|
23-Nov-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
switch a bunch of places to mnt_want_write_file() it's both faster (in case when file has been opened for write) and cleaner. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
66d7e7f0 |
|
12-Sep-2011 |
Arne Jansen <sensille@gmx.net> |
Btrfs: mark delayed refs as for cow Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
660d3f6c |
|
09-Dec-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix how we do delalloc reservations and how we free reservations on error Running xfstests 269 with some tracing my scripts kept spitting out errors about releasing bytes that we didn't actually have reserved. This took me down a huge rabbit hole and it turns out the way we deal with reserved_extents is wrong, we need to only be setting it if the reservation succeeds, otherwise the free() method will come in and unreserve space that isn't actually reserved yet, which can lead to other warnings and such. The math was all working out right in the end, but it caused all sorts of other issues in addition to making my scripts yell and scream and generally make it impossible for me to track down the original issue I was looking for. The other problem is with our error handling in the reservation code. There are two cases that we need to deal with 1) We raced with free. In this case free won't free anything because csum_bytes is modified before we dro the lock in our reservation path, so free rightly doesn't release any space because the reservation code may be depending on that reservation. However if we fail, we need the reservation side to do the free at that point since that space is no longer in use. So as it stands the code was doing this fine and it worked out, except in case #2 2) We don't race with free. Nobody comes in and changes anything, and our reservation fails. In this case we didn't reserve anything anyway and we just need to clean up csum_bytes but not free anything. So we keep track of csum_bytes before we drop the lock and if it hasn't changed we know we can just decrement csum_bytes and carry on. Because of the case where we can race with free()'s since we have to drop our spin_lock to do the reservation, I'm going to serialize all reservations with the i_mutex. We already get this for free in the heavy use paths, truncate and file write all hold the i_mutex, just needed to add it to page_mkwrite and various ioctl/balance things. With this patch my space leak scripts no longer scream bloody murder. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
306424cc |
|
14-Dec-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix ctime update of on-disk inode To reproduce the bug: # touch /mnt/tmp # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:23.412105981 +0800 # chattr +i /mnt/tmp # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:43.198105295 +0800 # umount /mnt # mount /dev/loop1 /mnt # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:23.412105981 +0800 We should update ctime of in-memory inode before calling btrfs_update_inode(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ece7d20e |
|
18-Nov-2011 |
Mike Fleetwood <mike.fleetwood@googlemail.com> |
Btrfs: Don't error on resizing FS to same size It seems overly harsh to fail a resize of a btrfs file system to the same size when a shrink or grow would succeed. User app GParted trips over this error. Allow it by bypassing the shrink or grow operation. Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
|
#
5bb14682 |
|
20-Nov-2011 |
Arnd Hannemann <arnd@arndnet.de> |
Btrfs: prefix resize related printks with btrfs: For the user it is confusing to find something like: [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472 in kernel log, because it doesn't point directly to btrfs. This patch prefixes those messages with "btrfs:" like other btrfs related printks. Signed-off-by: Arnd Hannemann <arnd@arndnet.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
745c4d8e |
|
20-Nov-2011 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: Fix up 32/64-bit compatibility for new ioctls This patch casts to unsigned long before casting to a pointer and fixes the following warnings: fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
740c3d22 |
|
02-Nov-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix the new inspection ioctls for 32 bit compat The new ioctls to follow backrefs are not clean for 32/64 bit compat. This reworks them for u64s everywhere. They are brand new, so there are no problems with changing the interface now. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6c41761f |
|
13-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: separate superblock items out of fs_info fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
f4c697e6 |
|
05-Sep-2011 |
Lukas Czerner <lczerner@redhat.com> |
btrfs: return EINVAL if start > total_bytes in fitrim ioctl We should retirn EINVAL if the start is beyond the end of the file system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate check for it. Also in the btrfs_trim_fs() it is possible that len+start might overflow if big values are passed. Fix it by decrementing the len so that start+len is equal to the file system size in the worst case. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
#
008873ea |
|
02-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: honor extent thresh during defragmentation We won't defrag an extent, if it's bigger than the threshold we specified and there's no small extent before it, but actually the code doesn't work this way. There are three bugs: - When should_defrag_range() decides we should keep on defragmenting an extent, last_len is not incremented. (old bug) - The length that passes to should_defrag_range() is not the length we're going to defrag. (new bug) - We always defrag 256K bytes data, and a big extent can be part of this range. (new bug) For a file with 4 extents: | 4K | 4K | 256K | 256K | The result of defrag with (the default) 256K extent thresh should be: | 264K | 256K | but with those bugs, we'll get: | 520K | Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
5ca49660 |
|
02-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix wrong max_to_defrag in btrfs_defrag_file() It's off-by-one, and thus we may skip the last page while defragmenting. An example case: # create /mnt/file with 2 4K file extents # btrfs fi defrag /mnt/file # sync # filefrag /mnt/file /mnt/file: 2 extents found So it's not defragmented. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
151a31b2 |
|
02-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: use i_size_read() in btrfs_defrag_file() Don't use inode->i_size directly, since we're not holding i_mutex. This also fixes another bug, that i_size can change after it's checked against 0 and then (i_size - 1) can be negative. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
cbcc8326 |
|
02-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix defragmentation regression There's an off-by-one bug: # create a file with lots of 4K file extents # btrfs fi defrag /mnt/file # sync # filefrag -v /mnt/file Filesystem type is: 9123683e File size of /mnt/file is 1228800 (300 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3372 64 1 64 3136 3435 1 2 65 3436 3136 64 3 129 3201 3499 1 4 130 3500 3201 64 5 194 3266 3563 1 6 195 3564 3266 64 7 259 3331 3627 1 8 260 3628 3331 40 eof After this patch: ... # filefrag -v /mnt/file Filesystem type is: 9123683e File size of /mnt/file is 1228800 (300 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3372 300 eof /mnt/file: 1 extent found Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
60ccf82f |
|
01-Sep-2011 |
Diego Calleja <diegocg@gmail.com> |
btrfs: fix memory leak in btrfs_defrag_file kmemleak found this: unreferenced object 0xffff8801b64af968 (size 512): comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s) hex dump (first 32 bytes): 00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff ................ 80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff ................ backtrace: [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0 [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240 [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20 [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210 [<ffffffff812479d7>] cleaner_kthread+0x177/0x190 [<ffffffff81075c7d>] kthread+0x8d/0xa0 [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10 [<ffffffffffffffff>] 0xffffffffffffffff "pages" is not always freed. Fix it removing the unnecesary additional return. Signed-off-by: Diego Calleja <diegocg@gmail.com>
|
#
e27425d6 |
|
27-Sep-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: only inherit btrfs specific flags when creating files Xfstests 79 was failing because we were inheriting the S_APPEND flag when we weren't supposed to. There isn't any specific documentation on this so I'm taking the test as the standard of how things work, and having S_APPEND set on a directory doesn't mean that S_APPEND gets inherited by its children according to this test. So only inherit btrfs specific things. This will let us set compress/nocompress on specific directories and everything in the directories will inherit this flag, same with nodatacow. With this patch test 79 passes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
3b16a4e3 |
|
21-Sep-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use the inode's mapping mask for allocating pages Johannes pointed out we were allocating only kernel pages for doing writes, which is kind of a big deal if you are on 32bit and have more than a gig of ram. So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we don't re-enter. Thanks, Reported-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
f7f43cc8 |
|
11-Oct-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: make sure not to defrag extents past i_size The btrfs file defrag code will loop through the extents and force COW on them. But there is a concurrent truncate in the middle of the defrag, it might end up defragging the same range over and over again. The problem is that writepage won't go through and do anything on pages past i_size, so the cow won't happen, so the file will appear to still be fragmented. defrag will end up hitting the same extents again and again. In the worst case, the truncate can actually live lock with the defrag because the defrag keeps creating new ordered extents which the truncate code keeps waiting on. The fix here is to make defrag check for i_size inside the main loop, instead of just once before the looping starts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2a0f7f57 |
|
10-Oct-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix recursive auto-defrag Follow those steps: # mount -o autodefrag /dev/sda7 /mnt # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1 # sync # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc and then it'll go into a loop: writeback -> defrag -> writeback ... It's because writeback writes [8K, 200K] and then writes [0, 8K]. I tried to make writeback know if the pages are dirtied by defrag, but the patch was a bit intrusive. Here I simply set writeback_index when we defrag a file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d7728c96 |
|
07-Jul-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
btrfs: new ioctls to do logical->inode and inode->path resolving these ioctls make use of the new functions initially added for scrub. they return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and all paths belonging to an inode (BTRFS_IOC_INO_PATHS). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
|
#
b6f3409b |
|
20-Sep-2011 |
Sage Weil <sage@newdream.net> |
Btrfs: reserve sufficient space for ioctl clone Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We need to reserve space for: - adjusting the old extent (possibly splitting it) - adding the new extent - updating the inode Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dde820fb |
|
18-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: don't change inode flag of the dest clone file The dst file will have the same inode flags with dst file after file clone, and I think it's unexpected. For example, the dst file will suddenly become immutable after getting some share of data with src file, if the src is immutable. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0e7b824c |
|
18-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: don't make a file partly checksummed through file clone To reproduce the bug: # mount /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/src bs=4K count=1 # umount /mnt # mount -o nodatasum /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/dst bs=4K count=1 # clone_range -s 4K -l 4K /mnt/src /mnt/dst # echo 3 > /proc/sys/vm/drop_caches # cat /mnt/dst # dmesg ... btrfs no csum found for inode 258 start 0 btrfs csum failed ino 258 off 0 csum 2566472073 private 0 It's because part of the file is checksummed and the other part is not, and then btrfs will complain checksum is not found when we read the file. Disallow file clone if src and dst file have different checksum flag, so we ensure a file is completely checksummed or unchecksummed. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
71ef0786 |
|
18-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix pages truncation in btrfs_ioctl_clone() It's a bug in commit f81c9cdc567cd3160ff9e64868d9a1a7ee226480 (Btrfs: truncate pages from clone ioctl target range) We should pass the dest range to the truncate function, but not the src range. Also move the function before locking extent state. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d525e8ab |
|
11-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: add dummy extent if dst offset excceeds file end in You can see there's no file extent with range [0, 4096]. Check this by btrfsck: # btrfsck /dev/sda7 root 5 inode 258 errors 100 ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d72c0842 |
|
11-Sep-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: calc file extent num_bytes correctly in file clone num_bytes should be 4096 not 12288. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f81c9cdc |
|
10-Aug-2011 |
Sage Weil <sage@newdream.net> |
Btrfs: truncate pages from clone ioctl target range We need to truncate page cache pages for the clone ioctl target range or else we'll confuse ourselves to no end. If the old data was cached, we used to still see it (until remount). If the page was partially updated we used to get a mix of old and new data. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
77906a50 |
|
13-Jul-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: copy string correctly in INO_LOOKUP ioctl Memory areas [ptr, ptr+total_len] and [name, name+total_len] may overlap, so it's wrong to use memcpy(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9e0baf60 |
|
15-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix enospc problems with delalloc So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a94733d0 |
|
11-Jul-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: use find_or_create_page instead of grab_cache_page grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we need GFP_NOFS so we don't deadlock. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
2fbe8c8a |
|
16-Jul-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
get rid of useless dget_parent() in fs/btrfs/ioctl.c both callers there have dentry->d_parent stabilized by the fact that their caller had obtained dentry from lookup_one_len() and had not dropped ->i_mutex on parent since then. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8351583e |
|
14-Jun-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: protect the pending_snapshots list with trans_lock Currently there is nothing protecting the pending_snapshots list on the transaction. We only hold the directory mutex that we are snapshotting and a read lock on the subvol_sem, so we could race with somebody else creating a snapshot in a different directory and end up with list corruption. So protect this list with the trans_lock. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
027ed2f0 |
|
08-Jun-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info() The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so don't declare the variable on stack. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a4689d2b |
|
31-May-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: use btrfs_ino to access inode number commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode number directly while it should use the helper with the new inode number allocator. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4cb5300b |
|
24-May-2011 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add mount -o auto_defrag This will detect small random writes into files and queue the up for an auto defrag process. It isn't well suited to database workloads yet, but works for smaller files such as rpm, sqlite or bdb databases. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1f78160c |
|
20-Apr-2011 |
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> |
Btrfs: using rcu lock in the reader side of devices list fs_devices->devices is only updated on remove and add device paths, so we can use rcu to protect it in the reader side Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e2156867 |
|
14-May-2011 |
Hugo Mills <hugo@carfax.org.uk> |
btrfs: Ensure the tree search ioctl returns the right number of records Btrfs's tree search ioctl has a field to indicate that no more than a given number of records should be returned. The ioctl doesn't honour this, as the tested value is not incremented until the end of the copy_to_sk function. This patch removes an unnecessary local variable, and updates the num_found counter as each key is found in the tree. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d82a6f1d |
|
11-May-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill BTRFS_I(inode)->block_group Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
a4abeea4 |
|
11-Apr-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: kill trans_mutex We use trans_mutex for lots of things, here's a basic list 1) To serialize trans_handles joining the currently running transaction 2) To make sure that no new trans handles are started while we are committing 3) To protect the dead_roots list and the transaction lists Really the serializing trans_handles joining is not too hard, and can really get bogged down in acquiring a reference to the transaction. So replace the trans_mutex with a trans_lock spinlock and use it to do the following 1) Protect fs_info->running_transaction. All trans handles have to do is check this, and then take a reference of the transaction and keep on going. 2) Protect the fs_info->trans_list. This doesn't get used too much, basically it just holds the current transactions, which will usually just be the currently committing transaction and the currently running transaction at most. 3) Protect the dead roots list. This is only ever processed by splicing the list so this is relatively simple. 4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using the trans_mutex before, so this is a pretty straightforward change. 5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over the entirety of the commit we need to have a way to block new people from creating a new transaction while we're doing our work. So we set no_trans_join and in join_transaction we test to see if that is set, and if it is we do a wait_on_commit. 6) Make the transaction use count atomic so we don't need to take locks to modify it when we're dropping references. 7) Add a commit_lock to the transaction to make sure multiple people trying to commit the same transaction don't race and commit at the same time. 8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl trans. I have tested this with xfstests, but obviously it is a pretty hairy change so lots of testing is greatly appreciated. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
7a7eaa40 |
|
12-Apr-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: take away the num_items argument from btrfs_join_transaction I keep forgetting that btrfs_join_transaction() just ignores the num_items argument, which leads me to sending pointless patches and looking stupid :). So just kill the num_items argument from btrfs_join_transaction and btrfs_start_ioctl_transaction, since neither of them use it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
16cdcec7 |
|
22-Apr-2011 |
Miao Xie <miaox@cn.fujitsu.com> |
btrfs: implement delayed inode items operation Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ebcb904d |
|
14-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix FS_IOC_SETFLAGS ioctl Steps to reproduce the bug: - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL - Call FS_IOC_SETFLAGS ioctl with flags=0 - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d0092bdd |
|
14-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: fix FS_IOC_GETFLAGS ioctl As we've added per file compression/cow support. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e1e8fb6a |
|
14-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
fs: remove FS_COW_FL FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file COW in btrfs, but FS_NOCOW_FL is sufficient. The fact is we don't have corresponding BTRFS_INODE_COW flag. COW is default, and FS_NOCOW_FL can be used to switch off COW for a single file. If we mount btrfs with nodatacow, a newly created file will be set with the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the FS_NOCOW_FL flag. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8628764e |
|
23-Mar-2011 |
Arne Jansen <sensille@gmx.net> |
btrfs: add readonly flag setting the readonly flag prevents writes in case an error is detected Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
475f6387 |
|
11-Mar-2011 |
Jan Schmidt <list.btrfs@jan-o-sch.net> |
btrfs: new ioctls for scrub adds ioctls necessary to start and cancel scrubs, to get current progress and to get info about devices to be scrubbed. Note that the scrub is done per-device and that the ioctl only returns after the scrub for this devices is finished or has been canceled. Signed-off-by: Arne Jansen <sensille@gmx.net>
|
#
b3b4aa74 |
|
20-Apr-2011 |
David Sterba <dsterba@suse.cz> |
btrfs: drop unused parameter from btrfs_release_path parameter tree root it's not used since commit 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer interface for large blocksizes") Signed-off-by: David Sterba <dsterba@suse.cz>
|
#
33345d01 |
|
19-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Always use 64bit inode number There's a potential problem in 32bit system when we exhaust 32bit inode numbers and start to allocate big inode numbers, because btrfs uses inode->i_ino in many places. So here we always use BTRFS_I(inode)->location.objectid, which is an u64 variable. There are 2 exceptions that BTRFS_I(inode)->location.objectid != inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2), and inode->i_ino will be used in those cases. Another reason to make this change is I'm going to use a special inode to save free ino cache, and the inode number must be > (u64)-256. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
581bb050 |
|
19-Apr-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Cache free inode numbers in memory Currently btrfs stores the highest objectid of the fs tree, and it always returns (highest+1) inode number when we create a file, so inode numbers won't be reclaimed when we delete files, so we'll run out of inode numbers as we keep create/delete files in 32bits machines. This fixes it, and it works similarly to how we cache free space in block cgroups. We start a kernel thread to read the file tree. By scanning inode items, we know which chunks of inode numbers are free, and we cache them in an rb-tree. Because we are searching the commit root, we have to carefully handle the cross-transaction case. The rb-tree is a hybrid extent+bitmap tree, so if we have too many small chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram of extents, and a bitmap will be used if we exceed this threshold. The extents threshold is adjusted in runtime. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
13f2696f |
|
11-Apr-2011 |
Daniel J Blueman <daniel.blueman@gmail.com> |
fix user annotation in ioctl.c Fix address space annotation correct in ioctl.c. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> BTRFS_BLOCK_GROUP_SYSTEM, @@ -2387,7 +2387,7 @@ long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) up_read(&info->groups_sem); } - user_dest = (struct btrfs_ioctl_space_info *) + user_dest = (struct btrfs_ioctl_space_info __user *) (arg + sizeof(struct btrfs_ioctl_space_args)); if (copy_to_user(user_dest, dest_orig, alloc_size)) Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
08fe4db1 |
|
27-Mar-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Fix uninitialized root flags for subvolumes root_item->flags and root_item->byte_limit are not initialized when a subvolume is created. This bug is not revealed until we added readonly snapshot support - now you mount a btrfs filesystem and you may find the subvolumes in it are readonly. To work around this problem, we steal a bit from root_item->inode_item->flags, and use it to indicate if those fields have been properly initialized. When we read a tree root from disk, we check if the bit is set, and if not we'll set the flag and initialize the two fields of the root item. Reported-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Andreas Philipp <philipp.andreas@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8b2b2d3c |
|
03-Apr-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: fix memory leak in btrfs_ioctl_start_sync() Call btrfs_end_transaction() if btrfs_commit_transaction_async() fails. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2d4e6f6ad |
|
24-Feb-2011 |
liubo <liubo2009@cn.fujitsu.com> |
Btrfs: fix return value of setflags ioctl setflags ioctl should return error when any checks fail. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f7039b1d |
|
24-Mar-2011 |
Li Dongyang <lidongyang@novell.com> |
Btrfs: add btrfs_trim_fs() to handle FITRIM We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
75e7cb7f |
|
22-Mar-2011 |
Liu Bo <liubo2009@cn.fujitsu.com> |
Btrfs: Per file/directory controls for COW and compression Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to Chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1->v2: - rebase to the latest btrfs. v2->v3: - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW will be screwed by inheritance from parent directory. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
db5b493a |
|
23-Mar-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
Btrfs: cleanup some BUG_ON() This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2e149670 |
|
23-Mar-2011 |
Serge E. Hallyn <serge@hallyn.com> |
userns: rename is_owner_or_cap to inode_owner_or_capable And give it a kernel-doc comment. [akpm@linux-foundation.org: btrfs changed in linux-next] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
66b4ffd1 |
|
31-Jan-2011 |
Josef Bacik <josef@redhat.com> |
Btrfs: handle errors in btrfs_orphan_cleanup If we cannot truncate an inode for some reason we will never delete the orphan item associated with that inode, which means that we will loop forever in btrfs_orphan_cleanup. Instead of doing this just return error so we fail to mount. It sucks, but hey it's better than hanging. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
b4dc2b8c |
|
15-Feb-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl - Check user-specified flags correctly - Check the inode owership - Search root item in root tree but not fs tree Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
51788b1b |
|
14-Feb-2011 |
Dan Rosenberg <drosenberg@vsecurity.com> |
btrfs: prevent heap corruption in btrfs_ioctl_space_info() Commit bf5fc093c5b625e4259203f1cee7ca73488a5620 refactored btrfs_ioctl_space_info() and introduced several security issues. space_args.space_slots is an unsigned 64-bit type controlled by a possibly unprivileged caller. The comparison as a signed int type allows providing values that are treated as negative and cause the subsequent allocation size calculation to wrap, or be truncated to 0. By providing a size that's truncated to 0, kmalloc() will return ZERO_SIZE_PTR. It's also possible to provide a value smaller than the slot count. The subsequent loop ignores the allocation size when copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR. The fix changes the slot count type and comparison typecast to u64, which prevents truncation or signedness errors, and also ensures that we don't copy more data than we've allocated in the subsequent loop. Note that zero-size allocations are no longer possible since there is already an explicit check for space_args.space_slots being 0 and truncation of this value is no longer an issue. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Josef Bacik <josef@redhat.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
98d5dc13 |
|
19-Jan-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
btrfs: fix return value check of btrfs_start_transaction() The error check of btrfs_start_transaction() is added, and the mistake of the error check on several places is corrected. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
abd30bb0 |
|
23-Jan-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
btrfs: check return value of btrfs_start_ioctl_transaction() properly btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL. So, it is necessary to use IS_ERR() to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3612b495 |
|
24-Jan-2011 |
Tsutomu Itoh <t-itoh@jp.fujitsu.com> |
btrfs: fix return value check of btrfs_join_transaction() The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock() is added, and the mistake of the error check in several places is corrected. For more stable Btrfs, I think that we should reduce BUG_ON(). But, I think that long time is necessary for this. So, I propose this patch as a short-term solution. With this patch: - To more stable Btrfs, the part that should be corrected is clarified. - The panic isn't done by the NULL pointer reference etc. (even if BUG_ON() is increased temporarily) - The error code is returned in the place where the error can be easily returned. As a long-term plan: - BUG_ON() is reduced by using the forced-readonly framework, etc. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4d728ec7 |
|
25-Jan-2011 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Fix file clone when source offset is not 0 Suppose: - the source extent is: [0, 100] - the src offset is 10 - the clone length is 90 - the dest offset is 0 This statement: new_key.offset = key.offset + destoff - off will produce such an extent for the dest file: [ino, BTRFS_EXTENT_DATA_KEY, -10] , which is obviously wrong. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
0caa102d |
|
20-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls This allows us to set a snapshot or a subvolume readonly or writable on the fly. Usage: Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS); Changelog for v3: - Change to pass __u64 as ioctl parameter. Changelog for v2: - Add _GETFLAGS ioctl. - Check if the passed fd is the root of a subvolume. - Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
b83cc969 |
|
20-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Add readonly snapshots support Usage: Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call ioctl(BTRFS_I0CTL_SNAP_CREATE_V2). Implementation: - Set readonly bit of btrfs_root_item->flags. - Add readonly checks in btrfs_permission (inode_permission), btrfs_setattr, btrfs_set/remove_xattr and some ioctls. Changelog for v3: - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags. - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
fa0d2b9b |
|
20-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Refactor btrfs_ioctl_snap_create() Split it into two functions for two different ioctls, since they share no common code. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
1a419d85 |
|
25-Oct-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Allow to specify compress method when defrag Update defrag ioctl, so one can choose lzo or zlib when turning on compression in defrag operation. Changelog: v1 -> v2 - Add incompability flag. - Fix to check invalid compress type. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
261507a0 |
|
16-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Allow to add new compression algorithm Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
#
fdfb1e4f |
|
09-Dec-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
Btrfs: Make async snapshot ioctl more generic If we had reserved some bytes in struct btrfs_ioctl_vol_args, we wouldn't have to create a new structure for async snapshot creation. Here we convert async snapshot ioctl to use a more generic ABI, as we'll add more ioctls for snapshots/subvolumes in the future, readonly snapshots for example. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
75eaa0e2 |
|
09-Dec-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix sync subvol/snapshot creation We were incorrectly taking the async path even for the sync ioctls by passing in &transid unconditionally. There's ample room for further cleanup here, but this keeps the fix simple. Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6a912213 |
|
20-Nov-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: use dget_parent where we can UPDATED There are lots of places where we do dentry->d_parent->d_inode without holding the dentry->d_lock. This could cause problems with rename. So instead we need to use dget_parent() and hold the reference to the parent as long as we are going to use it's inode and then dput it at the end. Signed-off-by: Josef Bacik <josef@redhat.com> Cc: raven@themaw.net Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5f3888ff |
|
18-Nov-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Set file size correctly in file clone Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the original filesize of the dest file 'file2' is 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Now clone file1 to file2, the dest file should be 40K, but it still shows 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2a6b8dae |
|
18-Nov-2010 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: Check if dest_offset is block-size aligned before cloning file We've done the check for src_offset and src_length, and We should also check dest_offset, otherwise we'll corrupt the destination file: (After cloning file1 to file2 with unaligned dest_offset) # cat /mnt/file2 cat: /mnt/file2: Input/output error Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4260f7c7 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). We additionally require that the user has write+exec permission on the subvol root inode. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
531cb13f |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: make SNAP_DESTROY async There is no reason to force an immediate commit when deleting a snapshot. Users have some expectation that space from a deleted snapshot be freed immediately, but even if we do commit the reclaim is a background process. If users _do_ want the deletion to be durable, they can call 'sync'. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
72fd032e |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: add SNAP_CREATE_ASYNC ioctl Create a snap without waiting for it to commit to disk. The ioctl is ordered such that subsequent operations will not be contained by the created snapshot, and the commit is initiated, but the ioctl does not wait for the snapshot to commit to disk. We return the specific transid to userspace so that an application can wait for this specific snapshot creation to commit via the WAIT_SYNC ioctl. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
46204592 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: add START_SYNC, WAIT_SYNC ioctls START_SYNC will start a sync/commit, but not wait for it to complete. Any modification started after the ioctl returns is guaranteed not to be included in the commit. If a non-NULL pointer is passed, the transaction id will be returned to userspace. WAIT_SYNC will wait for any in-progress commit to complete. If a transaction id is specified, the ioctl will block and then return (success) when the specified transaction has committed. If it has already committed when we call the ioctl, it returns immediately. If the specified transaction doesn't exist, it returns EINVAL. If no transaction id is specified, WAIT_SYNC will wait for the currently committing transaction to finish it's commit to disk. If there is no currently committing transaction, it returns success. These ioctls are useful for applications which want to impose an ordering on when fs modifications reach disk, but do not want to wait for the full (slow) commit process to do so. Picky callers can take the transid returned by START_SYNC and feed it to WAIT_SYNC, and be certain to wait only as long as necessary for the transaction _they_ started to reach disk. Sloppy callers can START_SYNC and WAIT_SYNC without a transid, and provided they didn't wait too long between the calls, they will get the same result. However, if a second commit starts before they call WAIT_SYNC, they may end up waiting longer for it to commit as well. Even so, a START_SYNC+WAIT_SYNC still guarantees that any operation completed before the START_SYNC reaches disk. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
fccdae43 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix lockdep warning on clone ioctl I'm no lockdep expert, but this appears to make the lockdep warning go away for the i_mutex locking in the clone ioctl. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
050006a7 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix clone ioctl where range is adjacent to extent We had an edge case issue where the requested range was just following an existing extent. Instead of skipping to the next extent, we used the previous one which lead to having zero sized extents. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9a019196 |
|
29-Oct-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix delalloc checks in clone ioctl The lookup_first_ordered_extent() was done on the wrong inode, and the ->delalloc_bytes test was wrong, as the following btrfs_wait_ordered_range() would only invoke a range write and wouldn't write the entire file data range. Also, a bad parameter was passed to btrfs_wait_ordered_range(). Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
559af821 |
|
29-Oct-2010 |
Andi Kleen <andi@firstfloor.org> |
Btrfs: cleanup warnings from gcc 4.6 (nonbugs) These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2354d08f |
|
29-Oct-2010 |
Julia Lawall <julia@diku.dk> |
Btrfs: use memdup_user helpers Use memdup_user when user data is immediately copied into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = \(kmalloc@p\|kzalloc@p\)(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) || ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
bf5fc093 |
|
29-Sep-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: fix the df ioctl to report raid types The new ENOSPC stuff broke the df ioctl since we no longer create seperate space info's for each RAID type. So instead, loop through each space info's raid lists so we can get the right RAID information which will allow the df ioctl to tell us RAID types again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
|
#
2ebc3464 |
|
19-Jul-2010 |
Dan Rosenberg <dan.j.rosenberg@gmail.com> |
Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE 1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check whether the donor file is append-only before writing to it. 2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer overflow that allows a user to specify an out-of-bounds range to copy from the source file (if off + len wraps around). I haven't been able to successfully exploit this, but I'd imagine that a clever attacker could use this to read things he shouldn't. Even if it's not exploitable, it couldn't hurt to be safe. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b5384d48 |
|
12-Jun-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix CLONE ioctl destination file size expansion to block boundary The CLONE and CLONE_RANGE ioctls round up the range of extents being cloned to the block size when the range to clone extends to the end of file (this is always the case with CLONE). It was then using that offset when extending the destination file's i_size. Fix this by not setting i_size beyond the originally requested ending offset. This bug was introduced by a22285a6 (2.6.35-rc1). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cf1e99a4 |
|
29-May-2010 |
Dan Carpenter <error27@gmail.com> |
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR btrfs_lookup_dir_item() can return either ERR_PTRs or null. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d327099a |
|
29-May-2010 |
Dan Carpenter <error27@gmail.com> |
Btrfs: unwind after btrfs_start_transaction() errors This was added by a22285a6a3: "Btrfs: Integrate metadata reservation with start_transaction". If we goto out here then we skip all the unwinding and there are locks still held etc. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d68fc57b |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Metadata reservation for orphan inodes reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8929ecfa |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Introduce global metadata reservation Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0ca1f7ce |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Update metadata reservation for delayed allocation Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a22285a6 |
|
16-May-2010 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Integrate metadata reservation with start_transaction Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5dc64164 |
|
15-May-2010 |
Dan Rosenberg <dan.j.rosenberg@gmail.com> |
Btrfs: check for read permission on src file in the clone ioctl The existing code would have allowed you to clone a file that was only open for writing Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cf8bfbf |
|
20-Mar-2010 |
Dan Carpenter <error27@gmail.com> |
Btrfs: check btrfs_get_extent return for IS_ERR() btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR() Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
c2b96929 |
|
20-Mar-2010 |
Dan Carpenter <error27@gmail.com> |
Btrfs: handle kmalloc() failure in inode lookup ioctl Return -ENOMEM if kmalloc() fails. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
683be16e |
|
20-Mar-2010 |
Dan Carpenter <error27@gmail.com> |
Btrfs: dereferencing freed memory The original code dereferenced range on the next line. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2f3014fc |
|
25-Mar-2010 |
Andrea Gelmini <andrea.gelmini@gelma.net> |
Btrfs: remove duplicate include in ioctl.c fs/btrfs/ioctl.c: ctree.h is included more than once. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
8ad6fcab |
|
17-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree This is used by the inode lookup ioctl to follow all the backrefs up to the subvol root. But the search being done would sometimes land one past the last item in the leaf instead of finding the backref. This changes the search to look for the highest possible backref and hop back one item. It also fixes a leaked path on failure to find the root. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1b53ac4d |
|
17-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: allow treeid==0 in the inode lookup ioctl When a root id of 0 is sent to the inode lookup ioctl, it will use the root of the file we're ioctling and pass the root id back to userland along with the results. This allows userland to do searches based on that root later on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
90fdde14 |
|
17-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: return keys for large items to the search ioctl The search ioctl was skipping large items entirely (ones that are too big for the results buffer). This changes things to at least copy the item header so that we can send information about the item back to userland. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
abc6e134 |
|
17-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix key checks and advance in the search ioctl The search ioctl was working well for finding tree roots, but using it for generic searches requires a few changes to how the keys are advanced. This treats the search control min fields for objectid, type and offset more like a key, where we drop the offset to zero once we bump the type, etc. The downside of this is that we are changing the min_type and min_offset fields during the search, and so the ioctl caller needs extra checks to make sure the keys in the result are the ones it wanted. This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make things more readable. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7fde62bf |
|
16-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: buffer results in the space_info ioctl The space_info ioctl was using copy_to_user inside rcu_read_lock. This commit changes things to copy into a buffer first and then dump the result down to userland. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
854d2c35 |
|
15-Mar-2010 |
Sage Weil <sage@newdream.net> |
Btrfs: fix search_ioctl key advance key->type is u8, not u64. fs/btrfs/ioctl.c: In function 'copy_to_sk': fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
91748467 |
|
28-Feb-2010 |
Akinobu Mita <akinobu.mita@gmail.com> |
btrfs: use memparse Use memparse() instead of its own private implementation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1406e432 |
|
13-Jan-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: add a "df" ioctl for btrfs df is a very loaded question in btrfs. This gives us a way to get the per-space usage information so we can tell exactly what is in use where. This will help us figure out ENOSPC problems, and help users better understand where their disk space is going. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2ac55d41 |
|
03-Feb-2010 |
Josef Bacik <josef@redhat.com> |
Btrfs: cache the extent state everywhere we possibly can V2 This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1e701a32 |
|
11-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add new defrag-range ioctl. The btrfs defrag ioctl was limited to doing the entire file. This commit adds a new interface that can defrag a specific range inside the file. It can also force compression on the file, allowing you to selectively compress individual files after they were created, even when mount -o compress isn't turned on. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
940100a4 |
|
10-Mar-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: be more selective in the defrag ioctl The btrfs defrag ioctl had some bugs around delalloc accounting, and it wasn't properly skipping pages that were not in the mapping. It wasn't properly clearing the page checked flag, which could make the writeback code ignore the page forever while pinning it as dirty. This commit fixes those problems and makes defrag a little smarter. It skips holes and it doesn't waste time defragging large extents. If a tiny extent comes before a very large extent, it will defrag both of them to make sure the tiny extent ends up next to something big. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6ef5ed0d |
|
11-Dec-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: add ioctl and incompat flag to set the default mount subvol This patch needs to go along with my previous patch. This lets us set the default dir item's location to whatever root we want to use as our default mounting subvol. With this we don't have to use mount -o subvol=<tree id> anymore to mount a different subvol, we can just set the new one and it will just magically work. I've done some moderate testing with this, mostly just switching the default mount around, mounting subvols and the default mount at the same time and such, everything seems to work. Thanks, Older kernels would generally be able to still mount the filesystem with the default subvolume set, but it would result in a different volume being mounted, which could be an even more unpleasant suprise for users. So if you set your default subvolume, you can't go back to older kernels. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ac8e9819 |
|
28-Feb-2010 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add search and inode lookup ioctls The search ioctl is a generic tool for doing btree searches from userland applications. The first user of the search ioctl is a subvolume listing feature, but we'll also use it to find new files in a subvolume. The search ioctl allows you to specify min and max keys to search for, along with min and max transid. It returns the items along with a header that includes the item key. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
98d377a0 |
|
17-Nov-2009 |
TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> |
Btrfs: add a function to lookup a directory path by following backrefs This will be used by the inode lookup ioctl. Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
86b9f2ec |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Fix per root used space accounting The bytes_used field in root item was originally planned to trace the amount of used data and tree blocks. But it never worked right since we can't trace freeing of data accurately. This patch changes it to only trace the amount of tree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2e4bfab9 |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Avoid orphan inodes cleanup during committing transaction btrfs_lookup_dentry may trigger orphan cleanup, so it's not good to call it while committing a transaction. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
920bbbfb |
|
12-Nov-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: Rewrite btrfs_drop_extents Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can avoid calling lock_extent within transaction. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ac6889cb |
|
09-Oct-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix file clone ioctl for bookend extents The file clone ioctl was incorrectly taking the offset into the extent on disk into account when calculating the length of the cloned extent. The length never changes based on the offset into the physical extent. Test case: fallocate -l 1g image mke2fs image bcp image image2 e2fsck -f image2 (errors on image2) The math bug ends up wrapping the length of the extent, and things go wrong from there. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
efefb143 |
|
09-Oct-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: remove negative dentry when deleting subvolumne The use of btrfs_dentry_delete is removing dentries from the dcache when deleting subvolumne. btrfs_dentry_delete ignores negative dentries. This is incorrect since if we don't remove the negative dentry, its parent dentry can't be removed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1ab86aed |
|
29-Sep-2009 |
Sage Weil <sage@newdream.net> |
Btrfs: fix error cases for ioctl transactions Fix leak of vfsmount write reference and open_ioctl_trans reference on ENOMEM. Clean up the error paths while we're at it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ed74f2d |
|
11-Sep-2009 |
Josef Bacik <josef@redhat.com> |
Btrfs: proper -ENOSPC handling At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
1fb58a60 |
|
21-Sep-2009 |
Sage Weil <sage@newdream.net> |
Btrfs: fix arithmetic error in clone ioctl Fix an arithmetic error that was breaking extents cloned via the clone ioctl starting in the second half of a file. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
76dda93c |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: add snapshot/subvolume destroy ioctl This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
4df27c4d |
|
21-Sep-2009 |
Yan, Zheng <zheng.yan@oracle.com> |
Btrfs: change how subvolumes are organized btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a1ed835e |
|
10-Sep-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix extent replacment race Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
405f5571 |
|
11-Jul-2009 |
Alexey Dobriyan <adobriyan@gmail.com> |
headers: smp_lock.h redux * Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
c8a894d7 |
|
27-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix the file clone ioctl for preallocated extents
|
#
0b4dcea5 |
|
11-Jun-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir This happens during subvol creation. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
6cbff00f |
|
17-Apr-2009 |
Christoph Hellwig <hch@lst.de> |
Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION Add support for the standard attributes set via chattr and read via lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were confusing the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d4f98a2 |
|
10-Jun-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5d847a8e |
|
14-May-2009 |
Li Hong <lihong.hi@gmail.com> |
Btrfs: remove outdated comment in btrfs_ioctl_resize() In Li Zefan's commit dae7b665cf6d6e6e733f1c9c16cf55547dd37e33, a combination call of kmalloc() and copy_from_user() is replaced by memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more. Signed-off-by: Li Hong <lihong.hi@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
21380931 |
|
21-Apr-2009 |
Joel Becker <joel.becker@oracle.com> |
Btrfs: Fix a bunch of printk() warnings. Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch to cast them all. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
e980b50c |
|
24-Apr-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: fix fallocate deadlock on inode extent lock The btrfs fallocate call takes an extent lock on the entire range being fallocated, and then runs through insert_reserved_extent on each extent as they are allocated. The problem with this is that btrfs_drop_extents may decide to try and take the same extent lock fallocate was already holding. The solution used here is to push down knowledge of the range that is already locked going into btrfs_drop_extents. It turns out that at least one other caller had the same bug. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
dae7b665 |
|
08-Apr-2009 |
Li Zefan <lizf@cn.fujitsu.com> |
btrfs: use memdup_user() Remove open-coded memdup_user(). Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may cause pagefault, it's pointless to pass GFP_NOFS to kmalloc(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ce3b0f8d |
|
29-Mar-2009 |
Al Viro <viro@zeniv.linux.org.uk> |
New helper - current_umask() current->fs->umask is what most of fs_struct users are doing. Put that into a helper function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6a63209f |
|
20-Feb-2009 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add better -ENOSPC handling This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
7eaebe7d |
|
21-Jan-2009 |
Huang Weiyi <weiyi.huang@gmail.com> |
Btrfs: removed unused #include <version.h>'s Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d397712b |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
52c26179 |
|
05-Jan-2009 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: update directory's size when creating subvol/snapshot Make sure directory's size properly updated when creating subvol/snapshot. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
e441d54d |
|
05-Jan-2009 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: add permission checks to the ioctls Only root can add/remove devices Only root can defrag subtrees Only files open for writing can be defragged Only files open for writing can be the destination for a clone Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ab67b7c1 |
|
19-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add missing mnt_drop_write in ioctl.c This patch adds the missing mnt_drop_write to match mnt_want_write in btrfs_ioctl_defrag and btrfs_ioctl_clone Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
d2fb3437 |
|
11-Dec-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: fix leaking block group on balance The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
cfc8ea87 |
|
11-Dec-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: mnt_drop_write in ioctl_trans_end Add missing mnt_drop_write to match the mnt_want_write in btrfs_ioctl_trans_start. Signed-off-by: Sage Weil <sage@newdream.net>
|
#
d20f7043 |
|
08-Dec-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: move data checksumming into a dedicated tree Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
607d432d |
|
02-Dec-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: add support for multiple csum algorithms This patch gives us the space we will need in order to have different csum algorithims at some point in the future. We save the csum algorithim type in the superblock, and use those instead of define's. Signed-off-by: Josef Bacik <jbacik@redhat.com>
|
#
7a865e8a |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: btrfs: pass void __user * to btrfs_ioctl_clone_range Cleans the code up a little and also avoids a sparse warning due to the incorrect cast in the current version of the code. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
4bcabaa3 |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: clean up btrfs_ioctl a little bit Provide a void __user *argp pointer so that we can avoid duplicating the cast for various sub-command calls. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
b2950863 |
|
02-Dec-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: make things static and include the right headers Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
1ffa4f42 |
|
02-Dec-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: remove unneeded btrfs_start_delalloc_inodes call It is called by btrfs_sync_fs. Signed-off-by: Sage Weil <sage@newdream.net>
|
#
4b4e25f2 |
|
20-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: compat code fixes The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ea9e8b11 |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: prevent loops in the directory tree when creating snapshots For a directory tree: /mnt/subvolA/subvolB btrfsctl -s /mnt/subvolA/subvolB /mnt Will create a directory loop with subvolA under subvolB. This commit uses the forward refs for each subvol and snapshot to error out before creating the loop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
0660b5af |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add backrefs and forward refs for subvols and snapshots Subvols and snapshots can now be referenced from any point in the directory tree. We need to maintain back refs for them so we can find lost subvols. Forward refs are added so that we know all of the subvols and snapshots referenced anywhere in the directory tree of a single subvol. This can be used to do recursive snapshotting (but they aren't yet) and it is also used to detect and prevent directory loops when creating new snapshots. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3394e160 |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Give each subvol and snapshot their own anonymous devid Each subvolume has its own private inode number space, and so we need to fill in different device numbers for each subvolume to avoid confusing applications. This commit puts a struct super_block into struct btrfs_root so it can call set_anon_super() and get a different device number generated for each root. btrfs_rename is changed to prevent renames across subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3de4586c |
|
17-Nov-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Allow subvolumes and snapshots anywhere in the directory tree Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
2b82032c |
|
17-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Seed device support Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c146afad |
|
12-Nov-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: mount ro and remount support This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
c5c9cd4d |
|
12-Nov-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: allow clone of an arbitrary file range This patch adds an additional CLONE_RANGE ioctl to clone an arbitrary (block-aligned) file range to another file. The original CLONE ioctl becomes a special case of cloning the entire file range. The logic is a bit more complex now since ranges may be cloned to different offsets, and because we may only be cloning the beginning or end of a particular extent or checksum item. An additional sanity check ensures the source and destination files aren't the same (which would previously deadlock), although eventually this could be extended to allow the duplication of file data at a different offset within the same file. Any extents within the destination range in the target file are dropped. We currently do not cope with the case where a compressed inline extent needs to be split. This will probably require decompressing the extent into a temporary address_space, and inserting just the cloned portion as a new compressed inline extent. For now, just return -EINVAL in this case. Note that this never comes up in the more common case of cloning an entire file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
d899e052 |
|
30-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add fallocate support v2 This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
80ff3856 |
|
30-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: update nodatacow code v2 This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
84234f3a |
|
29-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Add root tree pointer transaction ids This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
a3dddf3f |
|
10-Oct-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Don't call security_inode_mkdir during subvol creation Subvol creation already requires privs, and security_inode_mkdir isn't exported. For now we don't need it. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
cb8e7090 |
|
09-Oct-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: Fix subvolume creation locking rules Creating a subvolume is in many ways like a normal VFS ->mkdir, and we really need to play with the VFS topology locking rules. So instead of just creating the snapshot on disk and then later getting rid of confliting aliases do it correctly from the start. This will become especially important once we allow for subvolumes anywhere in the tree, and not just below a hidden root. Note that snapshots will need the same treatment, but do to the delay in creating them we can't do it currently. Chris promised to fix that issue, so I'll wait on that. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
3bb1a1bc |
|
09-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Remove offset field from struct btrfs_extent_ref The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
a76a3cd4 |
|
09-Oct-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Count space allocated to file in bytes This patch makes btrfs count space allocated to file in bytes instead of 512 byte sectors. Everything else in btrfs uses a byte count instead of sector sizes or blocks sizes, so this fits better. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
|
#
5b21f2ed |
|
26-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: extent_map and data=ordered fixes for space balancing * Add an EXTENT_BOUNDARY state bit to keep the writepage code from merging data extents that are in the process of being relocated. This allows us to do accounting for them properly. * The balancing code relocates data extents indepdent of the underlying inode. The extent_map code was modified to properly account for things moving around (invalidating extent_map caches in the inode). * Don't take the drop_mutex in the create_subvol ioctl. It isn't required. * Fix walking of the ordered extent list to avoid races with sys_unlink * Change the lock ordering rules. Transaction start goes outside the drop_mutex. This allows btrfs_commit_transaction to directly drop the relocation trees. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
31840ae1 |
|
23-Sep-2008 |
Zheng Yan <zheng.yan@oracle.com> |
Btrfs: Full back reference support This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
b214107e |
|
05-Sep-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: trivial sparse fixes Fix a bunch of trivial sparse complaints. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7ea394f1 |
|
05-Aug-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Fix nodatacow for the new data=ordered mode Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ae01a0ab |
|
04-Aug-2008 |
Yan Zheng <zheng.yan@oracle.com> |
Btrfs: Update clone file ioctl This patch updates the file clone ioctl for the tree locking and new data ordered code. --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
ea8c2819 |
|
04-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
9ca9ee09 |
|
04-Aug-2008 |
Sage Weil <sage@newdream.net> |
Btrfs: fix ioctl-initiated transactions vs wait_current_trans() Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in progress) breaks the transaction start/stop ioctls by making btrfs_start_transaction conditionally wait for the next transaction to start. If an application artificially is holding a transaction open, things deadlock. This workaround maintains a count of open ioctl-initiated transactions in fs_info, and avoids wait_current_trans() if any are currently open (in start_transaction() and btrfs_throttle()). The start transaction ioctl uses a new btrfs_start_ioctl_transaction() that _does_ call wait_current_trans(), effectively pushing the join/wait decision to the outer ioctl-initiated transaction. This more or less neuters btrfs_throttle() when ioctl-initiated transactions are in use, but that seems like a pretty fundamental consequence of wrapping lots of write()'s in a transaction. Btrfs has no way to tell if the application considers a given operation as part of it's transaction. Obviously, if the transaction start/stop ioctls aren't being used, there is no effect on current behavior. Signed-off-by: Sage Weil <sage@newdream.net> --- ctree.h | 1 + ioctl.c | 12 +++++++++++- transaction.c | 18 +++++++++++++----- transaction.h | 2 ++ 4 files changed, 27 insertions(+), 6 deletions(-) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f87f057b |
|
01-Aug-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Improve and cleanup locking done by walk_down_tree While dropping snapshots, walk_down_tree does most of the work of checking reference counts and limiting tree traversal to just the blocks that we are freeing. It dropped and held the allocation mutex in strange and confusing ways, this commit changes it to only hold the mutex while actually freeing a block. The rest of the checks around reference counts should be safe without the lock because we only allow one process in btrfs_drop_snapshot at a time. Other processes dropping reference counts should not drop it to 1 because their tree roots already have an extra ref on the block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
5516e595 |
|
23-Jul-2008 |
Mark Fasheh <mfasheh@suse.com> |
Btrfs: Null terminate strings passed in from userspace The 'char name[BTRFS_PATH_NAME_MAX]' member of struct btrfs_ioctl_vol_args is passed directly to strlen() after being copied from user. I haven't verified this, but in theory a userspace program could pass in an unterminated string and cause a kernel crash as strlen walks off the end of the array. This patch terminates the ->name string in all btrfs ioctl functions which currently use a 'struct btrfs_ioctl_vol_args'. Since the string is now properly terminated, it's length will never be longer than BTRFS_PATH_NAME_MAX so that error check has been removed. By the way, it might be better overall to just have the ioctl pass an unterminated string + length structure but I didn't bother with that since it'd change the kernel/user interface. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
8e8a1e31 |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Fix a few functions that exit without stopping their transaction Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
aec7477b |
|
23-Jul-2008 |
Josef Bacik <jbacik@redhat.com> |
Btrfs: Implement new dir index format Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
3eaa2885 |
|
24-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Fix the defragmention code and the block relocation code for data=ordered Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
7d9eb12c |
|
08-Jul-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Add locking around volume management (device add/remove/balance) Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
89ce8a63 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Add btrfs_end_transaction_throttle to force writers to wait for pending commits The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
a2135011 |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Replace the big fs_mutex with a collection of other locks Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
925baedd |
|
25-Jun-2008 |
Chris Mason <chris.mason@oracle.com> |
Btrfs: Start btree concurrency work. The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
df5b5520 |
|
11-Jun-2008 |
Christoph Hellwig <hch@lst.de> |
BTRFS_IOC_TRANS_START should be privilegued As mentioned in the comment next to it btrfs_ioctl_trans_start can do bad damage to filesystems and thus should be limited to privilegued users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|
#
f46b5a66 |
|
11-Jun-2008 |
Christoph Hellwig <hch@lst.de> |
Btrfs: split out ioctl.c Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
|