History log of /linux-master/fs/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
b7213ffa 15-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

qnx4: avoid stringop-overread errors

The qnx4 directory entries are 64-byte blocks that have different
contents depending on the a status byte that is in the last byte of the
block.

In particular, a directory entry can be either a "link info" entry with
a 48-byte name and pointers to the real inode information, or an "inode
entry" with a smaller 16-byte name and the full inode information.

But the code was written to always just treat the directory name as if
it was part of that "inode entry", and just extend the name to the
longer case if the status byte said it was a link entry.

That work just fine and gives the right results, but now that gcc is
tracking data structure accesses much more, the code can trigger a
compiler error about using up to 48 bytes (the long name) in a structure
that only has that shorter name in it:

fs/qnx4/dir.c: In function ‘qnx4_readdir’:
fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds source size 16 [-Werror=stringop-overread]
51 | size = strnlen(de->di_fname, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs/qnx4/qnx4.h:3,
from fs/qnx4/dir.c:16:
include/uapi/linux/qnx4_fs.h:45:25: note: source object declared here
45 | char di_fname[QNX4_SHORT_NAME_MAX];
| ^~~~~~~~

which is because the source code doesn't really make this whole "one of
two different types" explicit.

Fix this by introducing a very explicit union of the two types, and
basically explaining to the compiler what is really going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fdfc3463 12-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull namei updates from Al Viro:
"Clearing fallout from mkdirat in io_uring series. The fix in the
kern_path_locked() patch plus associated cleanups"

* 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
putname(): IS_ERR_OR_NULL() is wrong here
namei: Standardize callers of filename_create()
namei: Standardize callers of filename_lookup()
rename __filename_parentat() to filename_parentat()
namei: Fix use after free in kern_path_locked


8d4a0b5d 12-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smbfs updates from Steve French:
"cifs/smb3 updates:

- DFS reconnect fix

- begin creating common headers for server and client

- rename the cifs_common directory to smbfs_common to be more
consistent ie change use of the name cifs to smb (smb3 or smbfs is
more accurate, as the very old cifs dialect has long been
superseded by smb3 dialects).

In the future we can rename the fs/cifs directory to fs/smbfs.

This does not include the set of multichannel fixes nor the two
deferred close fixes (they are still being reviewed and tested)"

* tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: properly invalidate cached root handle when closing it
cifs: move SMB FSCTL definitions to common code
cifs: rename cifs_common to smbfs_common
cifs: update FSCTL definitions


78e70952 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- vduse driver ("vDPA Device in Userspace") supporting emulated virtio
block devices

- virtio-vsock support for end of record with SEQPACKET

- vdpa: mac and mq support for ifcvf and mlx5

- vdpa: management netlink for ifcvf

- virtio-i2c, gpio dt bindings

- misc fixes and cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
Documentation: Add documentation for VDUSE
vduse: Introduce VDUSE - vDPA Device in Userspace
vduse: Implement an MMU-based software IOTLB
vdpa: Support transferring virtual addressing during DMA mapping
vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
vhost-iotlb: Add an opaque pointer for vhost IOTLB
vhost-vdpa: Handle the failure of vdpa_reset()
vdpa: Add reset callback in vdpa_config_ops
vdpa: Fix some coding style issues
file: Export receive_fd() to modules
eventfd: Export eventfd_wake_count to modules
iova: Export alloc_iova_fast() and free_iova_fast()
virtio-blk: remove unneeded "likely" statements
virtio-balloon: Use virtio_find_vqs() helper
vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
vsock_test: update message bounds test for MSG_EOR
af_vsock: rename variables in receive loop
virtio/vsock: support MSG_EOR bit processing
vhost/vsock: support MSG_EOR bit processing
...


c605c396 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right
now as we have plenty of flags left, but could become one. (Hao)

- Fix lockdep issue introduced in this merge window (me)

- Fix a few issues with the worker creation (me, Pavel, Qiang)

- Fix regression with wq_has_sleeper() for IOPOLL (Pavel)

- Timeout link error propagation fix (Pavel)

* tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT
io_uring: fail links of cancelled timeouts
io-wq: fix memory leak in create_io_worker()
io-wq: fix silly logic error in io_task_work_match()
io_uring: drop ctx->uring_lock before acquiring sqd->lock
io_uring: fix missing mb() before waitqueue_active
io-wq: fix cancellation on create-worker failure


c0f7e49f 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph:
- fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
- update a MAINTAINERS email address (Chaitanya Kulkarni)
- set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
- handle errors from add_disk() (Luis Chamberlain)
- update the keep alive interval when kato is modified (Tatsuya Sasaki)
- fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
- do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
- only call synchronize_srcu when clearing current path (Daniel Wagner)
- revalidate paths during rescan (Hannes Reinecke)

- Split out the fs/block_dev into block/fops.c and block/bdev.c, which
has been long overdue. Do this now before -rc1, to avoid annoying
conflicts due to this (Christoph)

- blk-throtl use-after-free fix (Li)

- Improve plug depth for multi-device plugs, greatly increasing md
resync performance (Song)

- blkdev_show() locking fix (Tetsuo)

- n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
n64cart: fix return value check in n64cart_probe()
blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
block: move fs/block_dev.c to block/bdev.c
block: split out operations on block special files
blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
block: genhd: don't call blkdev_show() with major_names_lock held
nvme: update MAINTAINERS email address
nvme: add error handling support for add_disk()
nvme: only call synchronize_srcu when clearing current path
nvme: update keep alive interval when kato is modified
nvme-tcp: Do not reset transport on data digest errors
nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
nvmet: looks at the passthrough controller when initializing CAP
nvme: move nvme_multi_css into nvme.h
nvme-multipath: revalidate paths during rescan
nvme-multipath: set QUEUE_FLAG_NOWAIT


4396a731 09-Sep-2021 Amir Goldstein <amir73il@gmail.com>

fsnotify: fix sb_connectors leak

Fix a leak in s_fsnotify_connectors counter in case of a race between
concurrent add of new fsnotify mark to an object.

The task that lost the race fails to drop the counter before freeing
the unused connector.

Following umount() hangs in fsnotify_sb_delete()/wait_var_event(),
because s_fsnotify_connectors never drops to zero.

Fixes: ec44610fe2b8 ("fsnotify: count all objects with attached connectors")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Link: https://lore.kernel.org/linux-fsdevel/20210907063338.ycaw6wvhzrfsfdlp@xzhoux.usersys.redhat.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

32c2d33e 06-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT

Build check of __REQ_F_LAST_BIT should be larger than, not equal or larger
than. It's perfectly valid to have __REQ_F_LAST_BIT be 32, as that means
that the last valid bit is 31 which does fit in the type.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210907032243.114190-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bf9f243f 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-ksmbd-part2' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:

- various fixes pointed out by coverity, and a minor cleanup patch

- id mapping and ownership fixes

- an smbdirect fix

* tag '5.15-rc-ksmbd-part2' of git://git.samba.org/ksmbd:
ksmbd: fix control flow issues in sid_to_id()
ksmbd: fix read of uninitialized variable ret in set_file_basic_info
ksmbd: add missing assignments to ret on ndr_read_int64 read calls
ksmbd: add validation for ndr read/write functions
ksmbd: remove unused ksmbd_file_table_flush function
ksmbd: smbd: fix dma mapping error in smb_direct_post_send_data
ksmbd: Reduce error log 'speed is unknown' to debug
ksmbd: defer notify_change() call
ksmbd: remove setattr preparations in set_file_basic_info()
ksmbd: ensure error is surfaced in set_file_basic_info()
ndr: fix translation in ndr_encode_posix_acl()
ksmbd: fix translation in sid_to_id()
ksmbd: fix subauth 0 handling in sid_to_id()
ksmbd: fix translation in acl entries
ksmbd: fix translation in ksmbd_acls_fattr()
ksmbd: fix translation in create_posix_rsp_buf()
ksmbd: fix translation in smb2_populate_readdir_entry()
ksmbd: fix lookup on idmapped mounts


8dde2086 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix max_inline mount option limit on 64k page system

- lockdep fixes:
- update bdev time in a safer way
- move bdev put outside of sb write section when removing device
- fix possible deadlock when mounting seed/sprout filesystem

- zoned mode: fix split extent accounting

- minor include fixup

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix double counting of split ordered extent
btrfs: fix lockdep warning while mounting sprout fs
btrfs: delay blkdev_put until after the device remove
btrfs: update the bdev time directly when closing
btrfs: use correct header for div_u64 in misc.h
btrfs: fix upper limit for max_inline for page size 64K


9351590f 09-Sep-2021 Enzo Matsumiya <ematsumiya@suse.de>

cifs: properly invalidate cached root handle when closing it

Cached root file was not being completely invalidated sometimes.

Reproducing:
- With a DFS share with 2 targets, one disabled and one enabled
- start some I/O on the mount
# while true; do ls /mnt/dfs; done
- at the same time, disable the enabled target and enable the disabled
one
- wait for DFS cache to expire
- on reconnect, the previous cached root handle should be invalid, but
open_cached_dir_by_dentry() will still try to use it, but throws a
use-after-free warning (kref_get())

Make smb2_close_cached_fid() invalidate all fields every time, but only
send an SMB2_close() when the entry is still valid.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>

d6c338a7 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

- Support for VMAP_STACK

- Support for splice_write in hostfs

- Fixes for virt-pci

- Fixes for virtio_uml

- Various fixes

* tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: fix stub location calculation
um: virt-pci: fix uapi documentation
um: enable VMAP_STACK
um: virt-pci: don't do DMA from stack
hostfs: support splice_write
um: virtio_uml: fix memory leak on init failures
um: virtio_uml: include linux/virtio-uml.h
lib/logic_iomem: fix sparse warnings
um: make PCI emulation driver init/exit static


35776f10 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM development updates from Russell King:

- Rename "mod_init" and "mod_exit" so that initcall debug output is
actually useful (Randy Dunlap)

- Update maintainers entries for linux-arm-kernel to indicate it is
moderated for non-subscribers (Randy Dunlap)

- Move install rules to arch/arm/Makefile (Masahiro Yamada)

- Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

- Don't warn about atags_to_fdt() stack size (David Heidelberg)

- Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

- Get rid of set_fs() usage (Arnd Bergmann)

- Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
ARM: 9116/1: unified: Remove check for gcc < 4
ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
ARM: 9113/1: uaccess: remove set_fs() implementation
ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
ARM: 9111/1: oabi-compat: rework fcntl64() emulation
ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
ARM: 9107/1: syscall: always store thread_info->abi_syscall
ARM: 9109/1: oabi-compat: add epoll_pwait handler
ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
ARM: 9105/1: atags_to_fdt: don't warn about stack size
ARM: 9103/1: Drop ARCH_NR_GPIOS definition
ARM: 9102/1: move theinstall rules to arch/arm/Makefile
ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific


f154c806 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:
"Except for the xpram device driver removal it is all about fixes and
cleanups.

- Fix topology update on cpu hotplug, so notifiers see expected
masks. This bug was uncovered with SCHED_CORE support.

- Fix stack unwinding so that the correct number of entries are
omitted like expected by common code. This fixes KCSAN selftests.

- Add kmemleak annotation to stack_alloc to avoid false positive
kmemleak warnings.

- Avoid layering violation in common I/O code and don't unregister
subchannel from child-drivers.

- Remove xpram device driver for which no real use case exists since
the kernel is 64 bit only. Also all hypervisors got required
support removed in the meantime, which means the xpram device
driver is dead code.

- Fix -ENODEV handling of clp_get_state in our PCI code.

- Enable KFENCE in debug defconfig.

- Cleanup hugetlbfs s390 specific Kconfig dependency.

- Quite a lot of trivial fixes to get rid of "W=1" warnings, and and
other simple cleanups"

* tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
hugetlbfs: s390 is always 64bit
s390/ftrace: remove incorrect __va usage
s390/zcrypt: remove incorrect kernel doc indicators
scsi: zfcp: fix kernel doc comments
s390/sclp: add __nonstring annotation
s390/hmcdrv_ftp: fix kernel doc comment
s390: remove xpram device driver
s390/pci: read clp_list_pci_req only once
s390/pci: fix clp_get_state() handling of -ENODEV
s390/cio: fix kernel doc comment
s390/ctrlchar: fix kernel doc comment
s390/con3270: use proper type for tasklet function
s390/cpum_cf: move array from header to C file
s390/mm: fix kernel doc comments
s390/topology: fix topology information when calling cpu hotplug notifiers
s390/unwind: use current_frame_address() to unwind current task
s390/configs: enable CONFIG_KFENCE in debug_defconfig
s390/entry: make oklabel within CHKSTG macro local
s390: add kmemleak annotation in stack_alloc()
s390/cio: dont unregister subchannel from child-drivers


7b871c77 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull gfs2 setattr updates from Al Viro:
"Make it possible for filesystems to use a generic 'may_setattr()' and
switch gfs2 to using it"

* 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gfs2: Switch to may_setattr in gfs2_setattr
fs: Move notify_change permission checks into may_setattr


e2e694b9 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull root filesystem type handling updates from Al Viro:
"Teach init/do_mounts.c to handle non-block filesystems, hopefully
preventing even more special-cased kludges (such as root=/dev/nfs,
etc)"

* 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: simplify get_filesystem_list / get_all_fs_names
init: allow mounting arbitrary non-blockdevice filesystems as root
init: split get_fs_names


7b7699c0 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter fixes from Al Viro:
"Fixes for io-uring handling of iov_iter reexpands"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
io_uring: reexpand under-reexpanded iters
iov_iter: track truncated size


2e5fd489 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

- Fix a race condition in the teardown path of raw mode pmem
namespaces.

- Cleanup the code that filesystems use to detect filesystem-dax
capabilities of their underlying block device.

* tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: remove bdev_dax_supported
xfs: factor out a xfs_buftarg_is_dax helper
dax: stub out dax_supported for !CONFIG_FS_DAX
dax: remove __generic_fsdax_supported
dax: move the dax_read_lock() locking into dax_supported
dax: mark dax_get_by_host static
dm: use fs_dax_get_by_bdev instead of dax_get_by_host
dax: stop using bdevname
fsdax: improve the FS_DAX Kconfig description and help text
libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind


2ae2eb9d 09-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fail links of cancelled timeouts

When we cancel a timeout we should mark it with REQ_F_FAIL, so
linked requests are cancelled as well, but not queued for further
execution.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66e70be7 09-Sep-2021 Qiang.zhang <qiang.zhang@windriver.com>

io-wq: fix memory leak in create_io_worker()

BUG: memory leak
unreferenced object 0xffff888126fcd6c0 (size 192):
comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s)
backtrace:
[<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline]
[<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline]
[<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739
[<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline]
[<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866
[<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473
[<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933
[<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233
[<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462
[<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158
[<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164
[<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline]
[<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline]
[<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
[<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209
[<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
[<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302
[<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
[<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

when create_io_thread() return error, and not retry, the worker object
need to be freed.

Reported-by: syzbot+65454c239241d3d647da@syzkaller.appspotmail.com
Signed-off-by: Qiang.zhang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210909115822.181188-1-qiang.zhang@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d014f5f 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: move SMB FSCTL definitions to common code

The FSCTL definitions are in smbfsctl.h which should be
shared by client and server. Move the updated version of
smbfsctl.h into smbfs_common and have the client code use
it (subsequent patch will change the server to use this
common version of the header).

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

23e91d8b 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: rename cifs_common to smbfs_common

As we move to common code between client and server, we have
been asked to make the names less confusing, and refer less
to "cifs" and more to words which include "smb" instead to
e.g. "smbfs" for the client (we already have "ksmbd" for the
kernel server, and "smbd" for the user space Samba daemon).
So to be more consistent in the naming of common code between
client and server and reduce the risk of merge conflicts as
more common code is added - rename "cifs_common" to
"smbfs_common" (in future releases we also will rename
the fs/cifs directory to fs/smbfs)

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

fc111fb9 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: update FSCTL definitions

Add some missing defines used by ksmbd to the client
version of smbfsctl.h, and add a missing newer define
mentioned in the protocol definitions (MS-FSCC).

This will also make it easier to move to common code.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

3b33e3f4 08-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix silly logic error in io_task_work_match()

We check for the func with an OR condition, which means it always ends
up being false and we never match the task_work we want to cancel. In
the unexpected case that we do exit with that pending, we can trigger
a hang waiting for a worker to exit, but it was never created. syzbot
reports that as such:

INFO: task syz-executor687:8514 blocked for more than 143 seconds.
Not tainted 5.14.0-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor687 state:D stack:27296 pid: 8514 ppid: 8479 flags:0x00024004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0x940/0x26f0 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
io_wq_exit_workers fs/io-wq.c:1162 [inline]
io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197
io_uring_clean_tctx fs/io_uring.c:9607 [inline]
io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687
io_uring_files_cancel include/linux/io_uring.h:16 [inline]
do_exit+0x265/0x2a30 kernel/exit.c:780
do_group_exit+0x125/0x310 kernel/exit.c:922
get_signal+0x47f/0x2160 kernel/signal.c:2868
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x445cd9
RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c
RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154
R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000

While in there, also decrement accr->nr_workers. This isn't strictly
needed as we're exiting, but let's make sure the accounting matches up.

Fixes: 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
Reported-by: syzbot+f62d3e0a4ea4f38f5326@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

009ad9f0 08-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: drop ctx->uring_lock before acquiring sqd->lock

The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock
for all the registration opcodes. We also hold a ref to the ctx, and we
do drop the lock for other reasons to quiesce, so it's fine to drop the
ctx lock temporarily to grab the sqd->lock. This fixes the following
lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.5/25433 is trying to acquire lock:
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792

but task is already holding lock:
ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
__io_sq_thread fs/io_uring.c:7291 [inline]
io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #0 (&sqd->lock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3051 [inline]
check_prevs_add kernel/locking/lockdep.c:3174 [inline]
validate_chain kernel/locking/lockdep.c:3789 [inline]
__lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
__io_uring_register fs/io_uring.c:10757 [inline]
__do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&sqd->lock);
lock(&ctx->uring_lock);
lock(&sqd->lock);

*** DEADLOCK ***

Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3fc37253 08-Sep-2021 Dan Williams <dan.j.williams@intel.com>

Merge branch 'for-5.15/fsdax-cleanups' into for-5.15/libnvdimm

Include Christoph's rework of the dax_supported() helpers in the v5.15
libnvdimm update. This supports the ongoing dax-reflink enabling effort.


8a05abd0 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

- a set of patches to address fsync stalls caused by depending on
periodic rather than triggered MDS journal flushes in some cases
(Xiubo Li)

- a fix for mtime effectively not getting updated in case of competing
writers (Jeff Layton)

- a couple of fixes for inode reference leaks and various WARNs after
"umount -f" (Xiubo Li)

- a new ceph.auth_mds extended attribute (Jeff Layton)

- a smattering of fixups and cleanups from Jeff, Xiubo and Colin.

* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
ceph: fix dereference of null pointer cf
ceph: drop the mdsc_get_session/put_session dout messages
ceph: lockdep annotations for try_nonblocking_invalidate
ceph: don't WARN if we're forcibly removing the session caps
ceph: don't WARN if we're force umounting
ceph: remove the capsnaps when removing caps
ceph: request Fw caps before updating the mtime in ceph_write_iter
ceph: reconnect to the export targets on new mdsmaps
ceph: print more information when we can't find snaprealm
ceph: add ceph_change_snap_realm() helper
ceph: remove redundant initializations from mdsc and session
ceph: cancel delayed work instead of flushing on mdsc teardown
ceph: add a new vxattr to return auth mds for an inode
ceph: remove some defunct forward declarations
ceph: flush the mdlog before waiting on unsafe reqs
ceph: flush mdlog before umounting
ceph: make iterate_sessions a global symbol
ceph: make ceph_create_session_msg a global symbol
ceph: fix comment about short copies in ceph_write_end
ceph: fix memory leak on decode error in ceph_handle_caps


4cf0ccd0 06-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix control flow issues in sid_to_id()

Addresses-Coverity reported Control flow issues in sid_to_id()
/fs/ksmbd/smbacl.c: 277 in sid_to_id()
271
272 if (sidtype == SIDOWNER) {
273 kuid_t uid;
274 uid_t id;
275
276 id = le32_to_cpu(psid->sub_auth[psid->num_subauth - 1]);
>>> CID 1506810: Control flow issues (NO_EFFECT)
>>> This greater-than-or-equal-to-zero comparison of an unsigned value
>>> is always true. "id >= 0U".
277 if (id >= 0) {
278 /*
279 * Translate raw sid into kuid in the server's user
280 * namespace.
281 */
282 uid = make_kuid(&init_user_ns, id);

Addresses-Coverity: ("Control flow issues")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

4ffd5264 06-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix read of uninitialized variable ret in set_file_basic_info

Addresses-Coverity reported Uninitialized variables warninig :

/fs/ksmbd/smb2pdu.c: 5525 in set_file_basic_info()
5519 if (!rc) {
5520 inode->i_ctime = ctime;
5521 mark_inode_dirty(inode);
5522 }
5523 inode_unlock(inode);
5524 }
>>> CID 1506805: Uninitialized variables (UNINIT)
>>> Using uninitialized value "rc".
5525 return rc;
5526 }
5527
5528 static int set_file_allocation_info(struct ksmbd_work *work,
5529 struct ksmbd_file *fp, char *buf)
5530 {

Addresses-Coverity: ("Uninitialized variable")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

36bbeb33 06-Sep-2021 Colin Ian King <colin.king@canonical.com>

ksmbd: add missing assignments to ret on ndr_read_int64 read calls

Currently there are two ndr_read_int64 calls where ret is being checked
for failure but ret is not being assigned a return value from the call.
Static analyis is reporting the checks on ret as dead code. Fix this.

Addresses-Coverity: ("Logical dead code")
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

c57a91fb 08-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix missing mb() before waitqueue_active

In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a
memory barrier required by waitqueue_active(&ctx->poll_wait). There is
a wq_has_sleeper(), which does smb_mb() inside, but it's called only for
SQPOLL.

Fixes: 5fd4617840596 ("io_uring: be smarter about waking multiple CQ ring waiters")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2d338201 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...


6fcac87e 07-Sep-2021 QiuXi <qiuxi1@huawei.com>

coredump: fix memleak in dump_vma_snapshot()

dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot()
returns -EFAULT, the memory will be leaked, so we free it correctly.

Link: https://lkml.kernel.org/r/20210810020441.62806-1-qiuxi1@huawei.com
Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Signed-off-by: QiuXi <qiuxi1@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbd9d6f8 07-Sep-2021 David Oberhollenzer <david.oberhollenzer@sigma-star.at>

fs/coredump.c: log if a core dump is aborted due to changed file permissions

For obvious security reasons, a core dump is aborted if the filesystem
cannot preserve ownership or permissions of the dump file.

This affects filesystems like e.g. vfat, but also something like a 9pfs
share in a Qemu test setup, running as a regular user, depending on the
security model used. In those cases, the result is an empty core file and
a confused user.

To hopefully save other people a lot of time figuring out the cause, this
patch adds a simple log message for those specific cases.

[akpm@linux-foundation.org: s/|%s/%s/ in printk text]

Link: https://lkml.kernel.org/r/20210701233151.102720-1-david.oberhollenzer@sigma-star.at
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

98e2e409 07-Sep-2021 Zhen Lei <thunder.leizhen@huawei.com>

nilfs2: use refcount_dec_and_lock() to fix potential UAF

When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().

Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.

Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.

CPU0 CPU1
nilfs_put_root():
<-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);

kfree(root);
<-------- use-after-free

refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
kill_block_super+0x4a/0x90 fs/super.c:1446
deactivate_locked_super+0x6a/0xb0 fs/super.c:335
deactivate_super+0x85/0x90 fs/super.c:366
cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
__cleanup_mnt+0x15/0x20 fs/namespace.c:1125
task_work_run+0x8e/0x110 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There is no reproduction program, and the above is only theoretical
analysis.

Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

17243e1c 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group

kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del(). See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b2fe39c2 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group

If kobject_init_and_add returns with error, kobject_put() is needed here
to avoid memory leak, because kobject_init_and_add may return error
without freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a3e18125 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group

The kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del. See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24f8cb1e 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group

If kobject_init_and_add return with error, kobject_put() is needed here to
avoid memory leak, because kobject_init_and_add may return error without
freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbc6e7d4 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix NULL pointer in nilfs_##name##_attr_release

In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer. The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method.
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.

Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5f5dec07 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_device_group

Patch series "nilfs2: fix incorrect usage of kobject".

This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.

This patch (of 6):

Reported by syzkaller:

BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
kstrdup+0x36/0x70 mm/util.c:60
kstrdup_const+0x35/0x60 mm/util.c:83
kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
kobject_add_varg lib/kobject.c:384 [inline]
kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
legacy_get_tree+0x105/0x210 fs/fs_context.c:592
vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0xf9b/0x1990 fs/namespace.c:3235
do_mount+0xea/0x100 fs/namespace.c:3248
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.

And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.

Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1e1c1583 07-Sep-2021 Nicholas Piggin <npiggin@gmail.com>

fs/epoll: use a per-cpu counter for user's watches count

This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net

Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c2f273eb 07-Sep-2021 Ohhoon Kwon <ohoono.kwon@samsung.com>

connector: send event on write to /proc/[pid]/comm

While comm change event via prctl has been reported to proc connector by
'commit f786ecba4158 ("connector: add comm change event report to proc
connector")', connector listeners were missing comm changes by explicit
writes on /proc/[pid]/comm.

Let explicit writes on /proc/[pid]/comm report to proc connector.

Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8d23b208 07-Sep-2021 Christoph Hellwig <hch@lst.de>

proc: stop using seq_get_buf in proc_task_name

Use seq_escape_str and seq_printf instead of poking holes into the
seq_file abstraction.

Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bb9c14ad 08-Sep-2021 David Hildenbrand <david@redhat.com>

hugetlbfs: s390 is always 64bit

No need to check for 64BIT. While at it, let's just select
ARCH_SUPPORTS_HUGETLBFS from arch/s390/Kconfig.

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210908154506.20764-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

713b9825 08-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io-wq: fix cancellation on create-worker failure

WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1151 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1146 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151
io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794
Modules linked in:
Call Trace:
tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158
task_work_run+0xe0/0x1a0 kernel/task_work.c:164
tracehook_notify_signal include/linux/tracehook.h:212 [inline]
handle_signal_work kernel/entry/common.c:146 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae

When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just
call io_run_cancel() to clean up the request, it's already enqueued via
io_wqe_insert_work() and will be executed either by some other worker
during cancellation (e.g. in io_wq_put_and_exit()).

Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ea47ab11 07-Sep-2021 Al Viro <viro@zeniv.linux.org.uk>

putname(): IS_ERR_OR_NULL() is wrong here

Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b4a4f213 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Standardize callers of filename_create()

filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.

Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

794ebcea 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Standardize callers of filename_lookup()

filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.

[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example. It's more that we want non-consuming
variant for close relative of that function...]

Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c5f563f9 07-Sep-2021 Al Viro <viro@zeniv.linux.org.uk>

rename __filename_parentat() to filename_parentat()

... in separate commit, to avoid noise in previous one

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0766ec82 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Fix use after free in kern_path_locked

In 0ee50b47532a ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the filename before returning, and kern_path_locked() was migrated to
this calling convention. However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.

In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat(). Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().

Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper. Remove filename_parentat(),
now that we have no remaining callers.

Fixes: 0ee50b47532a ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

75b96f0e 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

- Allow mounting an active fuse device. Previously the fuse device
would always be mounted during initialization, and sharing a fuse
superblock was only possible through mount or namespace cloning

- Fix data flushing in syncfs (virtiofs only)

- Fix data flushing in copy_file_range()

- Fix a possible deadlock in atomic O_TRUNC

- Misc fixes and cleanups

* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: remove unused arg in fuse_write_file_get()
fuse: wait for writepages in syncfs
fuse: flush extending writes
fuse: truncate pagecache on atomic_o_trunc
fuse: allow sharing existing sb
fuse: move fget() to fuse_get_tree()
fuse: move option checking into fuse_fill_super()
fuse: name fs_context consistently
fuse: fix use after free in fuse_read_interrupt()


0bcfe68b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "memcg: enable accounting for pollfd and select bits arrays"

This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b.

Just like with the memcg lock accounting, the kernel test robot reports
a sizeable performance regression for this commit, and while it clearly
does the rigth thing in theory, we'll need to look at just how to avoid
or minimize the performance overhead of the memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

[ Note: the first link below is for this same commit but a different
commit ID, because it's the kernel test robot ended up noticing it in
Andrew Morton's patch queue ]

Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3754707b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "memcg: enable accounting for file lock caches"

This reverts commit 0f12156dff2862ac54235fc72703f18770769042.

The kernel test robot reports a sizeable performance regression for this
commit, and while it clearly does the rigth thing in theory, we'll need
to look at just how to avoid or minimize the performance overhead of the
memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cd1adf1b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"

This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0dca4462 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move fs/block_dev.c to block/bdev.c

Move it together with the rest of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd82cca7 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: split out operations on block special files

Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f79645df 06-Sep-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix double counting of split ordered extent

btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes.
Then, splitting an ordered extent will call btrfs_add_ordered_extent_*()
again for split extents, leading to double counting of the region of
a split extent. These leaked bytes are finally reported at unmount time
as follow:

BTRFS info (device dm-1): at unmount dio bytes count 364544

Fix the double counting by subtracting split extent's size from
fs_info->ordered_bytes.

Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c1247069 30-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: fix lockdep warning while mounting sprout fs

Following test case reproduces lockdep warning.

Test case:

$ mkfs.btrfs -f <dev1>
$ btrfstune -S 1 <dev1>
$ mount <dev1> <mnt>
$ btrfs device add <dev2> <mnt> -f
$ umount <mnt>
$ mount <dev2> <mnt>
$ umount <mnt>

The warning claims a possible ABBA deadlock between the threads
initiated by [#1] btrfs device add and [#0] the mount.

[ 540.743122] WARNING: possible circular locking dependency detected
[ 540.743129] 5.11.0-rc7+ #5 Not tainted
[ 540.743135] ------------------------------------------------------
[ 540.743142] mount/2515 is trying to acquire lock:
[ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.743458] but task is already holding lock:
[ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743541] which lock already depends on the new lock.
[ 540.743543] the existing dependency chain (in reverse order) is:

[ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
[ 540.743566] down_read_nested+0x48/0x2b0
[ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
[ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
[ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
[ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
[ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
[ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
[ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
[ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
[ 540.744166] __x64_sys_ioctl+0xe4/0x140
[ 540.744170] do_syscall_64+0x4b/0x80
[ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
[ 540.744184] __lock_acquire+0x155f/0x2360
[ 540.744188] lock_acquire+0x10b/0x5c0
[ 540.744190] __mutex_lock+0xb1/0xf80
[ 540.744193] mutex_lock_nested+0x27/0x30
[ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
[ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
[ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
[ 540.744472] legacy_get_tree+0x38/0x90
[ 540.744475] vfs_get_tree+0x30/0x140
[ 540.744478] fc_mount+0x16/0x60
[ 540.744482] vfs_kern_mount+0x91/0x100
[ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
[ 540.744536] legacy_get_tree+0x38/0x90
[ 540.744537] vfs_get_tree+0x30/0x140
[ 540.744539] path_mount+0x8d8/0x1070
[ 540.744541] do_mount+0x8d/0xc0
[ 540.744543] __x64_sys_mount+0x125/0x160
[ 540.744545] do_syscall_64+0x4b/0x80
[ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 540.744551] other info that might help us debug this:
[ 540.744552] Possible unsafe locking scenario:

[ 540.744553] CPU0 CPU1
[ 540.744554] ---- ----
[ 540.744555] lock(btrfs-chunk-00);
[ 540.744557] lock(&fs_devs->device_list_mutex);
[ 540.744560] lock(btrfs-chunk-00);
[ 540.744562] lock(&fs_devs->device_list_mutex);
[ 540.744564]
*** DEADLOCK ***

[ 540.744565] 3 locks held by mount/2515:
[ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
[ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
[ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.744708]
stack backtrace:
[ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5

But the device_list_mutex in clone_fs_devices() is redundant, as
explained below. Two threads [1] and [2] (below) could lead to
clone_fs_device().

[1]
open_ctree <== mount sprout fs
btrfs_read_chunk_tree()
mutex_lock(&uuid_mutex) <== global lock
read_one_dev()
open_seed_devices()
clone_fs_devices() <== seed fs_devices
mutex_lock(&orig->device_list_mutex) <== seed fs_devices

[2]
btrfs_init_new_device() <== sprouting
mutex_lock(&uuid_mutex); <== global lock
btrfs_prepare_sprout()
lockdep_assert_held(&uuid_mutex)
clone_fs_devices(seed_fs_device) <== seed fs_devices

Both of these threads hold uuid_mutex which is sufficient to protect
getting the seed device(s) freed while we are trying to clone it for
sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
can not free/write/replace because it is read-only. An unmounted seed
device can be freed by btrfs_free_stale_devices(), but it needs
uuid_mutex. So this patch removes the unnecessary device_list_mutex in
clone_fs_devices(). And adds a lockdep_assert_held(&uuid_mutex) in
clone_fs_devices().

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3fa421de 27-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: delay blkdev_put until after the device remove

When removing the device we call blkdev_put() on the device once we've
removed it, and because we have an EXCL open we need to take the
->open_mutex on the block device to clean it up. Unfortunately during
device remove we are holding the sb writers lock, which results in the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #407 Not tainted
------------------------------------------------------
losetup/11595 is trying to acquire lock:
ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_put+0x3a/0x220
btrfs_rm_device.cold+0x62/0xe5
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11595:
#0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc21255d4cb

So instead save the bdev and do the put once we've dropped the sb
writers lock in order to avoid the lockdep recursion.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8f96a5bf 27-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: update the bdev time directly when closing

We update the ctime/mtime of a block device when we remove it so that
blkid knows the device changed. However we do this by re-opening the
block device and calling filp_update_time. This is more correct because
it'll call the inode->i_op->update_time if it exists, but the block dev
inodes do not do this. Instead call generic_update_time() on the
bd_inode in order to avoid the blkdev_open path and get rid of the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #406 Not tainted
------------------------------------------------------
losetup/11596 is trying to acquire lock:
ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
file_open_name+0xc7/0x170
filp_open+0x2c/0x50
btrfs_scratch_superblocks.part.0+0x10f/0x170
btrfs_rm_device.cold+0xe8/0xed
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11596:
#0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cde7417c 30-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

btrfs: use correct header for div_u64 in misc.h

asm/do_div.h is for div_u64, but it is found in math64.h. This change
will make compiler job easier and prevent compiler errors in situation
where compiler will not find math64.h from another paths.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6f93e834 10-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: fix upper limit for max_inline for page size 64K

The mount option max_inline ranges from 0 to the sectorsize (which is
now equal to page size). But we parse the mount options too early and
before the actual sectorsize is read from the superblock. So the upper
limit of max_inline is unaware of the actual sectorsize and is limited
by the temporary sectorsize 4096, even on a system where the default
sectorsize is 64K.

Fix this by reading the superblock sectorsize before the mount option
parse.

Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

60f8fbaa 06-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.

In detail:

- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.

- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.

- Completion trace int -> unsigned -> long fix

- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL

- Fix regression with hash wait lock in this merge window

- Fix retry issued on block devices (Ming)

- Fix regression with links in this merge window (Pavel)

- Fix race with multi-shot poll and completions (Xiaoguang)

- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)

- Ensure submissions are flushed after running task_work (Pavel)"

* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker


a9667ac8 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: remove unused arg in fuse_write_file_get()

The struct fuse_conn argument is not used and can be removed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

660585b5 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: wait for writepages in syncfs

In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called. This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.

This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.

Introduce sync buckets, that hold a counter for the number of outstanding
write requests. On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.

It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets. Descendant buckets must
not complete until the parent completes. Add a count to the child (new)
bucket until the (parent) old bucket completes.

Use RCU protection to dereference the current bucket and to wake up an
emptied bucket. Use fc->lock to protect against parallel assignments to
the current bucket.

This leaves just the counter to be a possible scalability issue. The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

9c930054 31-Aug-2021 Xie Yongji <xieyongji@bytedance.com>

file: Export receive_fd() to modules

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210831103634.33-4-xieyongji@bytedance.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

7a6b92d3 31-Aug-2021 Xie Yongji <xieyongji@bytedance.com>

eventfd: Export eventfd_wake_count to modules

Export eventfd_wake_count so that some modules can use
the eventfd_signal_count() to check whether the
eventfd_signal() call should be deferred to a safe context.

NB(mst): this patch is not needed in Linus tree since there
eventfd_signal_count() has been superseded by an already exported
eventfd_signal_allowed().

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210831103634.33-3-xieyongji@bytedance.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

0319b848 05-Sep-2021 Geert Uytterhoeven <geert@linux-m68k.org>

binfmt: a.out: Fix bogus semicolon

fs/binfmt_aout.c: In function ‘load_aout_library’:
fs/binfmt_aout.c:311:27: error: expected ‘)’ before ‘;’ token
311 | MAP_FIXED | MAP_PRIVATE;
| ^
fs/binfmt_aout.c:309:10: error: too few arguments to function ‘vm_mmap’
309 | error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
| ^~~~~~~
In file included from fs/binfmt_aout.c:12:
include/linux/mm.h:2626:35: note: declared here
2626 | extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
| ^~~~~~~

Fix this by reverting the accidental replacement of a comma by a
semicolon.

Fixes: 42be8b42535183f8 ("binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()")
Reported-by: noreply@ellerman.id.au
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

49624efa 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux

Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.

There are some (minor) user-visible changes:

- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().

- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).

- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"

* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()


f7464060 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://github.com/Paragon-Software-Group/linux-ntfs3

Merge NTFSv3 filesystem from Konstantin Komarov:
"This patch adds NTFS Read-Write driver to fs/ntfs3.

Having decades of expertise in commercial file systems development and
huge test coverage, we at Paragon Software GmbH want to make our
contribution to the Open Source Community by providing implementation
of NTFS Read-Write driver for the Linux Kernel.

This is fully functional NTFS Read-Write driver. Current version works
with NTFS (including v3.1) and normal/compressed/sparse files and
supports journal replaying.

We plan to support this version after the codebase once merged, and
add new features and fix bugs. For example, full journaling support
over JBD will be added in later updates"

Link: https://lore.kernel.org/lkml/20210729134943.778917-1-almaz.alexandrovich@paragon-software.com/
Link: https://lore.kernel.org/lkml/aa4aa155-b9b2-9099-b7a2-349d8d9d8fbd@paragon-software.com/

* git://github.com/Paragon-Software-Group/linux-ntfs3: (35 commits)
fs/ntfs3: Change how module init/info messages are displayed
fs/ntfs3: Remove GPL boilerplates from decompress lib files
fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Rework file operations
fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Fix error handling in indx_insert_into_root()
fs/ntfs3: Potential NULL dereference in hdr_find_split()
fs/ntfs3: Fix error code in indx_add_allocate()
fs/ntfs3: fix an error code in ntfs_get_acl_ex()
fs/ntfs3: add checks for allocation failure
fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
fs/ntfs3: Do not use driver own alloc wrappers
fs/ntfs3: Use kernel ALIGN macros over driver specific
fs/ntfs3: Restyle comment block in ni_parse_reparse()
fs/ntfs3: Remove unused including <linux/version.h>
fs/ntfs3: Fix fall-through warnings for Clang
fs/ntfs3: Fix one none utf8 char in source file
...


6abaa83c 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.

In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.

On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.

Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device

Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option

In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"

* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...


0961f0c0 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`

Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"

* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...


303fff2b 02-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add validation for ndr read/write functions

If ndr->length is smaller than expected size, ksmbd can access invalid
access in ndr->data. This patch add validation to check ndr->offset is
over ndr->length. and added exception handling to check return value of
ndr read/write function.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

687c59e7 31-Aug-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: remove unused ksmbd_file_table_flush function

ksmbd_file_table_flush is a leftover from SMB1. This function is no longer
needed as SMB1 has been removed from ksmbd.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

72d6cbb5 29-Aug-2021 Hyunchul Lee <hyc.lee@gmail.com>

ksmbd: smbd: fix dma mapping error in smb_direct_post_send_data

Becase smb direct header is mapped and msg->num_sge
already is incremented, the decrement should be
removed from the condition.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

d475866e 29-Aug-2021 Per Forlin <perfn@axis.com>

ksmbd: Reduce error log 'speed is unknown' to debug

This log happens on servers with a network bridge since
the bridge does not have a specified link speed.
This is not a real error so change the error log to debug instead.

Signed-off-by: Per Forlin <perfn@axis.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

28a5d3de 26-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: defer notify_change() call

When ownership is changed we might in certain scenarios loose the
ability to alter the inode after we changed ownership. This can e.g.
happen when we are on an idmapped mount where uid 0 is mapped to uid
1000 and uid 1000 is mapped to uid 0.
A caller with fs*id 1000 will be able to create files as *id 1000 on
disk. They will also be able to change ownership of files owned by *id 0
to *id 1000 but they won't be able to change ownership in the other
direction. This means acl operations following notify_change() would
fail. Move the notify_change() call after the acls have been updated.
This guarantees that we don't end up with spurious "hash value diff"
warnings later on because we managed to change ownership but didn't
manage to alter acls.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

db7fb6fe 25-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: remove setattr preparations in set_file_basic_info()

Permission checking and copying over ownership information is the task
of the underlying filesystem not ksmbd. The order is also wrong here.
This modifies the inode before notify_change(). If notify_change() fails
this will have changed ownership nonetheless. All of this is unnecessary
though since the underlying filesystem's ->setattr handler will do all
this (if required) by itself.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

eb5784f0 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: ensure error is surfaced in set_file_basic_info()

It seems the error was accidently ignored until now. Make sure it is
surfaced.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

9467a0ce 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ndr: fix translation in ndr_encode_posix_acl()

The sid_to_id() helper encodes raw ownership information suitable for
s*id handling. This is conceptually equivalent to reporting ownership
information via stat to userspace. In this case the consumer is ksmbd
instead of a regular user. So when encoding raw ownership information
suitable for s*id handling later we need to map the id up according to
the user namespace of ksmbd itself taking any idmapped mounts into
account.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

55cd04d7 24-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in sid_to_id()

The sid_to_id() functions is relevant when changing ownership of
filesystem objects based on acl information. In this case we need to
first translate the relevant s*ids into k*ids in ksmbd's user namespace
and account for any idmapped mounts. Requesting a change in ownership
requires the inverse translation to be applied when we would report
ownership to userspace. So k*id_from_mnt() must be used here.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

f0bb29d5 24-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix subauth 0 handling in sid_to_id()

It's not obvious why subauth 0 would be excluded from translation. This
would lead to wrong results whenever a non-identity idmapping is used.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

0e844efe 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in acl entries

The ksmbd server performs translation of posix acls to smb acls.
Currently the translation is wrong since the idmapping of the mount is
used to map the ids into raw userspace ids but what is relevant is the
user namespace of ksmbd itself. The user namespace of ksmbd itself which
is the initial user namespace. The operation is similar to asking "What
*ids would a userspace process see given that k*id in the relevant user
namespace?". Before the final translation we need to apply the idmapping
of the mount in case any is used. Add two simple helpers for ksmbd.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

43205ca7 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in ksmbd_acls_fattr()

When creating new filesystem objects ksmbd translates between k*ids and
s*ids. For this it often uses struct smb_fattr and stashes the k*ids in
cf_uid and cf_gid. Let cf_uid and cf_gid always contain the final
information taking any potential idmapped mounts into account. When
finally translation cf_*id into s*ids translate them into the user
namespace of ksmbd since that is the relevant user namespace here.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

3cdc20e7 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in create_posix_rsp_buf()

When transferring ownership information to the client the k*ids are
translated into raw *ids before they are sent over the wire. The
function currently erroneously translates the k*ids according to the
mount's idmapping. Instead, reporting the owning *ids to userspace the
underlying k*ids need to be mapped up in the caller's user namespace.
This is how stat() works.
The caller in this instance is ksmbd itself and ksmbd always runs in the
initial user namespace. Translate according to that taking any potential
idmapped mounts into account.

Switch to from_k*id_munged() which ensures that the overflow*id is
returned instead of the (*id_t)-1 when the k*id can't be translated.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

475d6f98 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in smb2_populate_readdir_entry()

When transferring ownership information to the
client the k*ids are translated into raw *ids before they are sent over
the wire. The function currently erroneously translates the k*ids
according to the mount's idmapping. Instead, reporting the owning *ids
to userspace the underlying k*ids need to be mapped up in the caller's
user namespace. This is how stat() works.
The caller in this instance is ksmbd itself and ksmbd always runs in the
initial user namespace. Translate according to that.

The idmapping of the mount is already taken into account by the lower
filesystem and so kstat->*id will contain the mapped k*ids.

Switch to from_k*id_munged() which ensures that the overflow*id is
returned instead of the (*id_t)-1 when the k*id can't be translated.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

da1e7ada 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix lookup on idmapped mounts

It's great that the new in-kernel ksmbd server will support idmapped
mounts out of the box! However, lookup is currently broken. Lookup
helpers such as lookup_one_len() call inode_permission() internally to
ensure that the caller is privileged over the inode of the base dentry
they are trying to lookup under. So the permission checking here is
currently wrong.

Linux v5.15 will gain a new lookup helper lookup_one() that does take
idmappings into account. I've added it as part of my patch series to
make btrfs support idmapped mounts. The new helper is in linux-next as
part of David's (Sterba) btrfs for-next branch as commit
c972214c133b ("namei: add mapping aware lookup helper").

I've said it before during one of my first reviews: I would very much
recommend adding fstests to [1]. It already seems to have very
rudimentary cifs support. There is a completely generic idmapped mount
testsuite that supports idmapped mounts.

[1]: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

89c2b3b7 23-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: reexpand under-reexpanded iters

[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae

old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());

If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().

Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.

Cc: stable@vger.kernel.org
Reported-by: Palash Oswal <oswalpalash@gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b250e6d1 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.

- Show a warning if a different compiler is used for building external
modules.

- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.

- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.

- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.

- Add Nick Desaulniers as a Kbuild reviewer.

- Drop stale cc-option tests.

- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.

- Show a warning if 'FORCE' is missing for if_changed rules.

- Various cleanups

* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...


14726903 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...


22e5fe2a 02-Sep-2021 Nadav Amit <namit@vmware.com>

userfaultfd: prevent concurrent API initialization

userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.

However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.

Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.

To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.

Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a759a909 02-Sep-2021 Nadav Amit <namit@vmware.com>

userfaultfd: change mmap_changing to atomic

Patch series "userfaultfd: minor bug fixes".

Three unrelated bug fixes. The first two addresses possible issues (not
too theoretical ones), but I did not encounter them in practice.

The third patch addresses a test bug that causes the test to fail on my
system. It has been sent before as part of a bigger RFC.

This patch (of 3):

mmap_changing is currently a boolean variable, which is set and cleared
without any lock that protects against concurrent modifications.

mmap_changing is supposed to mark whether userfaultfd page-faults handling
should be retried since mappings are undergoing a change. However,
concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
mmap_changing to be false, although the remove event was still not read
(hence acknowledged) by the user.

Change mmap_changing to atomic_t and increase/decrease appropriately. Add
a debug assertion to see whether mmap_changing is negative.

Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5b78ed24 02-Sep-2021 Luigi Rizzo <lrizzo@google.com>

mm/pagemap: add mmap_assert_locked() annotations to find_vma*()

find_vma() and variants need protection when used. This patch adds
mmap_assert_lock() calls in the functions.

To make sure the invariant is satisfied, we also need to add a
mmap_read_lock() around the get_user_pages_remote() call in
get_arg_page(). The lock is not strictly necessary because the mm has
been newly created, but the extra cost is limited because the same mutex
was also acquired shortly before in __bprm_mm_init(), so it is hot and
uncontended.

[penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs]
Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp

Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f358afc5 02-Sep-2021 Christoph Hellwig <hch@lst.de>

mm: remove flush_kernel_dcache_page

flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.

The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:

1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible

Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

30acd0bd 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for new namesapces and struct nsproxy

Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations. It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

839d6820 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for fasync_cache

fasync_struct is used by almost all character device drivers to set up the
fasync queue, and for regular files by the file lease code. This
structure is quite small but long-living and it can be assigned for any
open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0f12156d 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for file lock caches

User can create file locks for each open file and force kernel to allocate
small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/b009f4c7-f0ab-c0ec-8e83-918f47d677da@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b6558434 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for pollfd and select bits arrays

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of
memory till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/56e31cb5-6e1e-bdba-d7ca-be64b9842363@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

79f6540b 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for mnt_cache entries

Patch series "memcg accounting from OpenVZ", v7.

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1. Now we're
rebasing again, revising our old patches and trying to push them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready
yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.

This patch (of 10):

The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times. Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bb902cb4 02-Sep-2021 Yutian Yang <nglaive@gmail.com>

memcg: charge fs_context and legacy_fs_context

This patch adds accounting flags to fs_context and legacy_fs_context
allocation sites so that kernel could correctly charge these objects.

We have written a PoC to demonstrate the effect of the missing-charging
bugs. The PoC takes around 1,200MB unaccounted memory, while it is
charged for only 362MB memory usage. We evaluate the PoC on QEMU x86_64
v5.2.90 + Linux kernel v5.10.19 + Debian buster. All the limitations
including ulimits and sysctl variables are set as default. Specifically,
the hard NOFILE limit and nr_open in sysctl are both 1,048,576.

/*------------------------- POC code ----------------------------*/

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/file.h>
#include <time.h>
#include <sys/wait.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>
#include <fcntl.h>
#include <linux/mount.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define STACK_SIZE (8 * 1024)
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif
static inline int fsopen(const char *fs_name, unsigned int flags)
{
return syscall(__NR_fsopen, fs_name, flags);
}

static char thread_stack[512][STACK_SIZE];

int thread_fn(void* arg)
{
for (int i = 0; i< 800000; ++i) {
int fsfd = fsopen("nfs", FSOPEN_CLOEXEC);
if (fsfd == -1) {
errExit("fsopen");
}
}
while(1);
return 0;
}

int main(int argc, char *argv[]) {
int thread_pid;
for (int i = 0; i < 1; ++i) {
thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \
SIGCHLD, NULL);
}
while(1);
return 0;
}

/*-------------------------- end --------------------------------*/

Link: https://lkml.kernel.org/r/1626517201-24086-1-git-send-email-nglaive@gmail.com
Signed-off-by: Yutian Yang <nglaive@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <shenwenbo@zju.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

51cc3a66 02-Sep-2021 Hugh Dickins <hughd@google.com>

fs, mm: fix race in unlinking swapfile

We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off. That is not supposed to be possible.

2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.

may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.

Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.

Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
Fixes: f9454548e17c ("don't unlink an active swapfile")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9857a17f 02-Sep-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: remove try_get_page(), call try_get_compound_head() directly

try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7490a2d2 02-Sep-2021 Shakeel Butt <shakeelb@google.com>

writeback: memcg: simplify cgroup_writeback_by_id

Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
pages for a memcg. However mem_cgroup_wb_stats() does a lot more than
just get the number of dirty pages. Just directly get the number of dirty
pages instead of calling mem_cgroup_wb_stats(). Also
cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
remove the unused 'nr' parameter and no need to explicitly flush memcg
stats.

Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7ae12c80 02-Sep-2021 Johannes Weiner <hannes@cmpxchg.org>

fs: inode: count invalidated shadow pages in pginodesteal

pginodesteal is supposed to capture the impact that inode reclaim has on
the page cache state. Currently, it doesn't consider shadow pages that
get dropped this way, even though this can have a significant impact on
paging behavior, memory pressure calculations etc.

To improve visibility into these effects, make sure shadow pages get
counted when they get dropped through inode reclaim.

This changes the return value semantics of invalidate_mapping_pages()
semantics slightly, but the only two users are the inode shrinker itsel
and a usb driver that logs it for debugging purposes.

Link: https://lkml.kernel.org/r/20210614211904.14420-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

16e2df2a 02-Sep-2021 Johannes Weiner <hannes@cmpxchg.org>

fs: drop_caches: fix skipping over shadow cache inodes

When drop_caches truncates the page cache in an inode it also includes any
shadow entries for evicted pages. However, there is a preliminary check
on whether the inode has pages: if it has *only* shadow entries, it will
skip running truncation on the inode and leave it behind.

Fix the check to mapping_empty(), such that it runs truncation on any
inode that has cache entries at all.

Link: https://lkml.kernel.org/r/20210614211904.14420-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fee468fd 02-Sep-2021 Jan Kara <jack@suse.cz>

writeback: reliably update bandwidth estimation

Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

633a2abb 02-Sep-2021 Jan Kara <jack@suse.cz>

writeback: track number of inodes under writeback

Patch series "writeback: Fix bandwidth estimates", v4.

Fix estimate of writeback throughput when device is not fully busy doing
writeback. Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

This patch (of 4):

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.

Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9673e005 02-Sep-2021 Gang He <ghe@suse.com>

ocfs2: ocfs2_downconvert_lock failure results in deadlock

Usually, ocfs2_downconvert_lock() function always downconverts dlm lock to
the expected level for satisfy dlm bast requests from the other nodes.

But there is a rare situation. When dlm lock conversion is being
canceled, ocfs2_downconvert_lock() function will return -EBUSY. You need
to be aware that ocfs2_cancel_convert() function is asynchronous in fsdlm
implementation.

If we does not requeue this lockres entry, ocfs2 downconvert thread no
longer handles this dlm lock bast request. Then, the other nodes will not
get the dlm lock again, the current node's process will be blocked when
acquire this dlm lock again.

Link: https://lkml.kernel.org/r/20210830044621.12544-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6c85c2c7 02-Sep-2021 Tuo Li <islituo@gmail.com>

ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info()

A memory block is allocated through kmalloc(), and its return value is
assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not
initialized but it is accessed in:
iput(oinfo->dqi_gqinode);

To fix this possible uninitialized-variable access, assign NULL to
oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the
assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init()
in ocfs2_global_read_info().

Link: https://lkml.kernel.org/r/20210804031832.57154-1-islituo@gmail.com
Signed-off-by: Tuo Li <islituo@gmail.com>
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2f566394 02-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

ocfs2: remove an unnecessary condition

The case where "tmp_oh" is NULL is handled at the start of the function.
At this point we know it's non-NULL so this will always return 1.

Link: https://lkml.kernel.org/r/YOcItgIXtisi3MaO@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Larry Chen <lchen@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8d0920bd 21-Apr-2021 David Hildenbrand <david@redhat.com>

mm: remove VM_DENYWRITE

All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

4589ff7c 23-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: remove in-tree usage of MAP_DENYWRITE

At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.

In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.

While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
"Overlayfs did not honor positive i_writecount on realfile for
VM_DENYWRITE mappings." [1]

[1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

fe69d560 23-Apr-2021 David Hildenbrand <david@redhat.com>

kernel/fork: always deny write access to current MM exe_file

We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.

Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

42be8b42 21-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()

uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().

For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.

With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.

This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

31efe48e 03-Sep-2021 Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

io_uring: fix possible poll event lost in multi shot mode

IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode,
that means once one pure poll request returns one event(cqe), we'll
need to read or write continually until EAGAIN is returned, then I think
there is a possible poll event lost race in multi shot mode:

t1 poll request add | |
t2 | |
t3 event happens | |
t4 task work add | |
t5 | task work run |
t6 | commit one cqe |
t7 | | user app handles cqe
t8 | new event happen |
t9 | add back to waitqueue |
t10 |

After t6 but before t9, if new event happens, there'll be no wakeup
operation, and if user app has picked up this cqe in t7, read or write
until EAGAIN is returned. In t8, new event happens and will be lost,
though this race window maybe small.

To fix this possible race, add poll request back to waitqueue before
committing cqe.

Fixes: 88e41cf928a6 ("io_uring: add multishot mode for IORING_OP_POLL_ADD")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210903142436.5767-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d4ad41e 01-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: prolong tctx_task_work() with flushing

io_submit_flush_completions() may enqueue linked requests for task_work
execution, so don't leave tctx_task_work() right after the tw list is
exhausted, but try to flush and then retry.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

63637853 01-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't disable kiocb_done() CQE batching

Not passing issue_flags from kiocb_done() into __io_complete_rw() means
that completion batching for this case is disabled, e.g. for most of
buffered reads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fa84693b 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL

SQPOLL has a different thread doing submissions, we need to check for
that and use the right task context when updating the worker values.
Just hold the sqd->lock across the operation, this ensures that the
thread cannot go away while we poke at ->io_uring.

Link: https://github.com/axboe/liburing/issues/420
Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Tested-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05a444d3 29-Aug-2021 Colin Ian King <colin.king@canonical.com>

ceph: fix dereference of null pointer cf

Currently in the case where kmem_cache_alloc fails the null pointer
cf is dereferenced when assigning cf->is_capsnap = false. Fix this
by adding a null pointer check and return path.

Cc: stable@vger.kernel.org
Addresses-Coverity: ("Dereference null return")
Fixes: b2f9fa1f3bd8 ("ceph: correctly handle releasing an embedded cap flush")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a9c9a6f7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, qla2xxx,
target, smartpqi, lpfc, mpt3sas).

The core change causing the most churn was replacing the command
request field request with a macro, allowing us to offset map to it
and remove the redundant field; the same was also done for the tag
field.

The most impactful change is the final removal of scsi_ioctl, which
has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
scsi: ufs: ufs-exynos: Fix static checker warning
scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Copyright updates for 14.0.0.1 patches
scsi: lpfc: Update lpfc version to 14.0.0.1
scsi: lpfc: Add bsg support for retrieving adapter cmf data
scsi: lpfc: Add cmf_info sysfs entry
scsi: lpfc: Add debugfs support for cm framework buffers
scsi: lpfc: Add support for maintaining the cm statistics buffer
scsi: lpfc: Add rx monitoring statistics
scsi: lpfc: Add support for the CM framework
scsi: lpfc: Add cmfsync WQE support
scsi: lpfc: Add support for cm enablement buffer
scsi: lpfc: Add cm statistics buffer support
scsi: lpfc: Add EDC ELS support
scsi: lpfc: Expand FPIN and RDF receive logging
scsi: lpfc: Add MIB feature enablement support
scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
scsi: fc: Add EDC ELS definition
...


9f358999 26-Jan-2021 Jeff Layton <jlayton@kernel.org>

ceph: drop the mdsc_get_session/put_session dout messages

These are very chatty, racy, and not terribly useful. Just remove them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3eaf5aa1 02-Sep-2021 Jeff Layton <jlayton@kernel.org>

ceph: lockdep annotations for try_nonblocking_invalidate

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a76d0a9c 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: don't WARN if we're forcibly removing the session caps

For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42ad631b 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: don't WARN if we're force umounting

Force umount will try to close the sessions by setting the session
state to _CLOSING. We don't want to WARN in this situation, since it's
expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a6d37ccd 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: remove the capsnaps when removing caps

capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b11ed503 11-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: request Fw caps before updating the mtime in ceph_write_iter

The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.

This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.

Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.

URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d517b398 17-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: reconnect to the export targets on new mdsmaps

In the case where the export MDS has crashed just after the EImportStart
journal is flushed, a standby MDS takes over for it and when replaying
the EImportStart journal the MDS will wait the client to reconnect. That
may never happen because the client may not have registered or opened
the sessions yet.

When receiving a new map, ensure we reconnect to valid export targets as
well if their sessions don't exist yet.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

692e1715 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: print more information when we can't find snaprealm

Print a bit more information when we can't find the realm during
ceph_add_cap. Show both the inode number and the old realm inode
number.

Suggested-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0ba92e1c 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: add ceph_change_snap_realm() helper

Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c80dc3ae 27-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: remove redundant initializations from mdsc and session

The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so
there's no need to explicitly initialize either of their fields to 0.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b4002173 27-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: cancel delayed work instead of flushing on mdsc teardown

The first thing metric_delayed_work does is check mdsc->stopping,
and then return immediately if it's set. That's good since we would
have already torn down the metric structures at this point, otherwise,
but there is no locking around mdsc->stopping.

It's possible that the ceph_metric_destroy call could race with the
delayed_work, in which case we could end up with the delayed_work
accessing destroyed percpu variables.

At this point in the mdsc teardown, the "stopping" flag has already been
set, so there's no benefit to flushing the work. Move the work
cancellation in ceph_metric_destroy ahead of the percpu variable
destruction, and eliminate the flush_delayed_work call in
ceph_mdsc_destroy.

Fixes: 18f473b384a6 ("ceph: periodically send perf metrics to MDSes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

40e309de 26-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: add a new vxattr to return auth mds for an inode

Add a new vxattr that shows what MDS is authoritative for an inode (if
we happen to have auth caps). If we don't have an auth cap for the inode
then just return -1.

URL: https://tracker.ceph.com/issues/1276
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

49f8899e 06-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: remove some defunct forward declarations

We missed these in the recent fscache rework.

Reported-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

e1a4541e 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: flush the mdlog before waiting on unsafe reqs

For the client requests who will have unsafe and safe replies from
MDS daemons, in the MDS side the MDS daemons won't flush the mdlog
(journal log) immediatelly, because they think it's unnecessary.
That's true for most cases but not all, likes the fsync request.
The fsync will wait until all the unsafe replied requests to be
safely replied.

Normally if there have multiple threads or clients are running, the
whole mdlog in MDS daemons could be flushed in time if any request
will trigger the mdlog submit thread. So usually we won't experience
the normal operations will stuck for a long time. But in case there
has only one client with only thread is running, the stuck phenomenon
maybe obvious and the worst case it must wait at most 5 seconds to
wait the mdlog to be flushed by the MDS's tick thread periodically.

This patch will trigger to flush the mdlog in the relevant and auth
MDSes to which the in-flight requests are sent just before waiting
the unsafe requests to finish.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d095559c 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: flush mdlog before umounting

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

59b312f3 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: make iterate_sessions a global symbol

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fba97e80 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: make ceph_create_session_msg a global symbol

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ce3a8732 14-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix comment about short copies in ceph_write_end

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2ad32cf0 01-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix memory leak on decode error in ceph_handle_caps

If we hit a decoding error late in the frame, then we might exit the
function without putting the pool_ns string. Ensure that we always put
that reference on the way out of the function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c815f04b 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit updates from Shuah Khan:
"This KUnit update for Linux 5.15-rc1 adds new features and tests:

Tool:

- support for '--kernel_args' to allow setting module params

- support for '--raw_output' option to show just the kunit output
during make

Tests:

- new KUnit tests for checksums and timestamps

- Print test statistics on failure

- Integrates UBSAN into the KUnit testing framework. It fails KUnit
tests whenever it reports undefined behavior"

* tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Print test statistics on failure
kunit: tool: make --raw_output support only showing kunit output
kunit: tool: add --kernel_args to allow setting module params
kunit: ubsan integration
fat: Add KUnit tests for checksums and timestamps


eceae1e7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:

- fix a race in configfs_lookup (Sishuai Gong)

- minor cleanups (me)

* tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs:
configfs: fix a race in configfs_lookup()
configfs: fold configfs_attach_attr into configfs_lookup
configfs: simplify the configfs_dirent_is_ready
configfs: return -ENAMETOOLONG earlier in configfs_lookup


265113f7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
"This set includes a number of minor fixes and cleanups related to the
networking changes in the last release.

A patch to delay ack messages reduces network traffic significantly"

* tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: avoid comms shutdown delay in release_lockspace
fs: dlm: fix return -EINTR on recovery stopped
fs: dlm: implement delayed ack handling
fs: dlm: move receive loop into receive handler
fs: dlm: fix multiple empty writequeue alloc
fs: dlm: generic connect func
fs: dlm: auto load sctp module
fs: dlm: introduce generic listen
fs: dlm: move to static proto ops
fs: dlm: introduce con_next_wq helper
fs: dlm: cleanup and remove _send_rcom
fs: dlm: clear CF_APP_LIMITED on close
fs: dlm: fix typo in tlv prefix
fs: dlm: use READ_ONCE for config var
fs: dlm: use sk->sk_socket instead of con->sock


3146cba9 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: make worker creation resilient against signals

If a task is queueing async work and also handling signals, then we can
run into the case where create_io_thread() is interrupted and returns
failure because of that. If this happens for creating the first worker
in a group, then that worker will never get created and we can hang the
ring.

If we do get a fork failure, retry from task_work. With signals we have
to be a bit careful as we cannot simply queue as task_work, as we'll
still have signals pending at that point. Punt over a normal workqueue
first and then create from task_work after that.

Lastly, ensure that we handle fatal worker creations. Worker creation
failures are normally not fatal, only if we fail to create one in an empty
worker group can we not make progress. Right now that is ignored, ensure
that we handle that and run cancel on the work item.

There are two paths that create new workers - one is the "existing worker
going to sleep", and the other is "no workers found for this work, create
one". The former is never fatal, as workers do exist in the group. Only
the latter needs to be carefully handled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

05c5f4ee 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: get rid of FIXED worker flag

It makes the logic easier to follow if we just get rid of the fixed worker
flag, and simply ensure that we never exit the last worker in the group.
This also means that no particular worker is special.

Just track the last timeout state, and if we have hit it and no work
is pending, check if there are other workers. If yes, then we can exit
this one safely.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

b0cfcdd9 16-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

d_path: make 'prepend()' fill up the buffer exactly on overflow

Instead of just marking the buffer as having overflowed, fill it up as
much as we can. That will allow the overflow case to then return
whatever truncated result if it wants to.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

111c1aa8 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
"In addition to some ext4 bug fixes and cleanups, this cycle we add the
orphan_file feature, which eliminates bottlenecks when doing a large
number of parallel truncates and file deletions, and move the discard
operation out of the jbd2 commit thread when using the discard mount
option, to better support devices with slow discard operations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: make the updating inode data procedure atomic
ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
ext4: move inode eio simulation behind io completeion
ext4: Improve scalability of ext4 orphan file handling
ext4: Orphan file documentation
ext4: Speedup ext4 orphan inode handling
ext4: Move orphan inode handling into a separate file
ext4: Support for checksumming from journal triggers
ext4: fix race writing to an inline_data file while its xattrs are changing
jbd2: add sparse annotations for add_transaction_credits()
ext4: fix sparse warnings
ext4: Make sure quota files are not grabbed accidentally
ext4: fix e2fsprogs checksum failure for mounted filesystem
ext4: if zeroout fails fall back to splitting the extent node
ext4: reduce arguments of ext4_fc_add_dentry_tlv
ext4: flush background discard kwork when retry allocation
ext4: get discard out of jbd2 commit kthread contex
ext4: remove the repeated comment of ext4_trim_all_free
ext4: add new helper interface ext4_try_to_trim_range()
ext4: remove the 'group' parameter of ext4_trim_extent
...


815409a1 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

- Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

- Improve performance by enabling RCU lookup.

- Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument. The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: enable RCU'd ->get_acl()
vfs: add rcu argument to ->get_acl() callback
ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
ovl: use kvalloc in xattr copy-up
ovl: update ctime when changing fileattr
ovl: skip checking lower file's i_writecount on truncate
ovl: relax lookup error on mismatch origin ftype
ovl: do not set overlay.opaque for new directories
ovl: add ovl_allow_offline_changes() helper
ovl: disable decoding null uuid with redirect_dir
ovl: consistent behavior for immutable/append-only inodes
ovl: copy up sync/noatime fileattr flags
ovl: pass ovl_fs to ovl_check_setxattr()
fs: add generic helper for filling statx attribute flags


412106c2 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
"In this cycle, direct I/O and fsdax support for uncompressed files are
now added in order to avoid double-caching for loop device and VM
container use cases. All uncompressed cases are now turned into iomap
infrastructure, which looks much simpler and cleaner.

In addition, fiemap support is added for both (un)compressed files by
using iomap infrastructure as well so end users can easily get file
distribution. We've also added chunk-based uncompressed files support
for data deduplication as the next step of VM container use cases.

Summary:

- support direct I/O for all uncompressed files

- support fsdax for non-tailpacking regular files

- use iomap infrastructure for all uncompressed cases

- support fiemap for both (un)compressed files

- introduce chunk-based files for chunk deduplication

- some cleanups"

* tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix double free of 'copied'
erofs: support reading chunk-based uncompressed files
erofs: introduce chunk-based file on-disk format
erofs: add fiemap support with iomap
erofs: add support for the full decompressed length
erofs: remove the mapping parameter from erofs_try_to_free_cached_page()
erofs: directly use wrapper erofs_page_is_managed() when shrinking
erofs: convert all uncompressed cases to iomap
erofs: dax support for non-tailpacking regular file
erofs: iomap support for non-tailpacking DIO


89594c74 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:
"Preparatory work for the fscache rewrite that's being worked on and
fix some bugs. These include:

- Always select netfs stats when enabling fscache stats since they're
displayed through the same procfile.

- Add a cookie debug ID that can be used in tracepoints instead of a
pointer and cache it in the netfs_cache_resources struct rather
than in the netfs_read_request struct to make it more available.

- Use file_inode() in cachefiles rather than dereferencing
file->f_inode directly.

- Provide a procfile to display fscache cookies.

- Remove the fscache and cachefiles histogram procfiles.

- Remove the fscache object list procfile.

- Avoid using %p in fscache and cachefiles as the value is hashed and
not comparable to the register dump in an oops trace.

- Fix the cookie hash function to actually achieve useful dispersion.

- Fix fscache_cookie_put() so that it doesn't dereference the cookie
pointer in the tracepoint after the refcount has been decremented
(we're only allowed to do that if we decremented it to zero).

- Use refcount_t rather than atomic_t for the fscache_cookie
refcount"

* tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Use refcount_t for the cookie refcount instead of atomic_t
fscache: Fix fscache_cookie_put() to not deref after dec
fscache: Fix cookie key hashing
cachefiles: Change %p in format strings to something else
fscache: Change %p in format strings to something else
fscache: Remove the object list procfile
fscache, cachefiles: Remove the histogram stuff
fscache: Procfile to display cookies
fscache: Add a cookie debug ID and use that in traces
cachefiles: Use file_inode() rather than accessing ->f_inode
netfs: Move cookie debug ID to struct netfs_cache_resources
fscache: Select netfs stats if fscache stats are enabled


2e3a51b5 29-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change how module init/info messages are displayed

Usually in file system init() messages are only displayed in info level.
Change level from notice to info, but keep CONFIG_NTFS3_64BIT_CLUSTER in
notice level. Also this need even more attention so let's put big
warning here so that nobody will not try accidentally use it.

There is also no good reason to display internal stuff like binary tree
search. This is always on option which can only disabled for debugging
purposes by developer. Also this message does not even check if
developer has disabled it or not so it is useless info.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

989e795b 26-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove GPL boilerplates from decompress lib files

Files already have SDPX identifier so no reason to keep boilerplates in
these files anymore.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

dd854e4b 25-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter

This check will be also performed in generic_file_read_iter() so we do
not want to check this two times in a row.

This was founded with Smatch
fs/ntfs3/file.c:803 ntfs_file_read_iter()
warn: unused return: count = iov_iter_count()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d4e8e135 25-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()

Use fiemap_prep() to check valid flags. It also shrink request scope
(@len) to what the fs can actually handle.

This address following Smatch static checker warning:
fs/ntfs3/frecord.c:1894 ni_fiemap()
warn: potential integer overflow from user 'vbo + len'

Because fiemap_prep() shrinks @len this cannot happened anymore.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: lore.kernel.org/ntfs3/20210825080440.GA17407@kili/

Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation")
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

90c90cda 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
"There's a lot in this cycle.

Starting with bug fixes: To avoid livelocks between the logging code
and the quota code, we've disabled the ability of quotaoff to turn off
quota accounting. (Admins can still disable quota enforcement, but
truly turning off accounting requires a remount.) We've tried to do
this in a careful enough way that there shouldn't be any user visible
effects aside from quotaoff no longer randomly hanging the system.

We've also fixed some bugs in runtime log behavior that could trip up
log recovery if (otherwise unrelated) transactions manage to start and
commit concurrently; some bugs in the GETFSMAP ioctl where we would
incorrectly restrict the range of records output if the two xfs
devices are of different sizes; a bug that resulted in fallocate
funshare failing unnecessarily; and broken behavior in the xfs inode
cache when DONTCACHE is in play.

As for new features: we now batch inode inactivations in percpu
background threads, which sharply decreases frontend thread wait time
when performing file deletions and should improve overall directory
tree deletion times. This eliminates both the problem where closing an
unlinked file (especially on a frozen fs) can stall for a long time,
and should also ease complaints about direct reclaim bogging down on
unlinked file cleanup.

Starting with this release, we've enabled pipelining of the XFS log.
On workloads with high rates of metadata updates to different shards
of the filesystem, multiple threads can be used to format committed
log updates into log checkpoints.

Lastly, with this release, two new features have graduated to
supported status: inode btree counters (for faster mounts), and
support for dates beyond Y2038. Expect these to be enabled by default
in a future release of xfsprogs.

Summary:

- Fix a potential log livelock on busy filesystems when there's so
much work going on that we can't finish a quotaoff before filling
up the log by removing the ability to disable quota accounting.

- Introduce the ability to use per-CPU data structures in XFS so that
we can do a better job of maintaining CPU locality for certain
operations.

- Defer inode inactivation work to per-CPU lists, which will help us
batch that processing. Deletions of large sparse files will
*appear* to run faster, but all that means is that we've moved the
work to the backend.

- Drop the EXPERIMENTAL warnings from the y2038+ support and the
inode btree counters, since it's been nearly a year and no
complaints have come in.

- Remove more of our bespoke kmem* variants in favor of using the
standard Linux calls.

- Prepare for the addition of log incompat features in upcoming
cycles by actually adding code to support this.

- Small cleanups of the xattr code in preparation for landing support
for full logging of extended attribute updates in a future cycle.

- Replace the various log shutdown state and flag code all over xfs
with a single atomic bit flag.

- Fix a serious log recovery bug where log item replay can be skipped
based on the start lsn of a transaction even though the transaction
commit lsn is the key data point for that by enforcing start lsns
to appear in the log in the same order as commit lsns.

- Enable pipelining in the code that pushes log items to disk.

- Drop ->writepage.

- Fix some bugs in GETFSMAP where the last fsmap record reported for
a device could extend beyond the end of the device, and a separate
bug where query keys for one device could be applied to another.

- Don't let GETFSMAP query functions edit their input parameters.

- Small cleanups to the scrub code's handling of perag structures.

- Small cleanups to the incore inode tree walk code.

- Constify btree function parameters that aren't changed, so that
there will never again be confusion about range query functions
changing their input parameters.

- Standardize the format and names of tracepoint data attributes.

- Clean up all the mount state and feature flags to use wrapped
bitset functions instead of inconsistently open-coded flag checks.

- Fix some confusion between xfs_buf hash table key variable vs.
block number.

- Fix a mis-interaction with iomap where we reported shared delalloc
cow fork extents to iomap, which would cause the iomap unshare
operation to return IO errors unnecessarily.

- Fix DONTCACHE behavior"

* tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (103 commits)
xfs: fix I_DONTCACHE
xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
xfs: fix perag structure refcounting error when scrub fails
xfs: rename buffer cache index variable b_bn
xfs: convert bp->b_bn references to xfs_buf_daddr()
xfs: introduce xfs_buf_daddr()
xfs: kill xfs_sb_version_has_v3inode()
xfs: introduce xfs_sb_is_v5 helper
xfs: remove unused xfs_sb_version_has wrappers
xfs: convert xfs_sb_version_has checks to use mount features
xfs: convert scrub to use mount-based feature checks
xfs: open code sb verifier feature checks
xfs: convert xfs_fs_geometry to use mount feature checks
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
xfs: convert remaining mount flags to state flags
xfs: convert mount flags to features
xfs: consolidate mount option features in m_features
xfs: replace xfs_sb_version checks with feature flag checks
xfs: reflect sb features in xfs_mount
xfs: rework attr2 feature and mount options
...


bcfeebbf 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull exit cleanups from Eric Biederman:
"In preparation of doing something about PTRACE_EVENT_EXIT I have
started cleaning up various pieces of code related to do_exit. Most of
that code I did not manage to get tested and reviewed before the merge
window opened but a handful of very useful cleanups are ready to be
merged.

The first change is simply the removal of the bdflush system call. The
code has now been disabled long enough that even the oldest userspace
working userspace setups anyone can find to test are fine with the
bdflush system call being removed.

Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
calling do_exit directly is interesting only in that it is nearly the
most difficult of the incorrect uses of do_exit to remove.

The change to the seccomp code to simply send a signal instead of
calling do_coredump directly is a very nice little cleanup made
possible by realizing the existing signal sending helpers were missing
a little bit of functionality that is easy to provide"

* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal/seccomp: Dump core when there is only one live thread
signal/seccomp: Refactor seccomp signal and coredump generation
signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
exit/bdflush: Remove the deprecated bdflush system call


48983701 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull siginfo si_trapno updates from Eric Biederman:
"The full set of si_trapno changes was not appropriate as a fix for the
newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
related cleanups.

This is the rest of the cleanups for si_trapno that reduces it from
being a really weird arch special case that is expect to be always
present (but isn't) on the architectures that support it to being yet
another field in the _sigfault union of struct siginfo.

The changes have been reviewed and marinated in linux-next. With the
removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
that works across architectures should be easier to write and
maintain"

* 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
signal: Verify the alignment and size of siginfo_t
signal: Remove the generic __ARCH_SI_TRAPNO support
signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
arm64: Add compile-time asserts for siginfo_t offsets
arm: Add compile-time asserts for siginfo_t offsets
sparc64: Add compile-time asserts for siginfo_t offsets


15e20db2 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: only exit on fatal signals

If the application uses io_uring and also relies heavily on signals
for communication, that can cause io-wq workers to spuriously exit
just because the parent has a signal pending. Just ignore signals
unless they are fatal.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

f95dc207 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: split bounded and unbounded work into separate lists

We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48c13d ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecd95673 26-Aug-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: avoid comms shutdown delay in release_lockspace

When dlm_release_lockspace does active shutdown on connections to
other nodes, the active shutdown will wait for any exisitng passive
shutdowns to be resolved. But, the sequence of operations during
dlm_release_lockspace can prevent the normal resolution of passive
shutdowns (processed normally by way of lockspace recovery.)
This disruption of passive shutdown handling can cause the active
shutdown to wait for a full timeout period, delaying the completion
of dlm_release_lockspace.

To fix this, make dlm_release_lockspace resolve existing passive
shutdowns (by calling dlm_clear_members earlier), before it does
active shutdowns. The active shutdowns will not find any passive
shutdowns to wait for, and will not be delayed.

Reported-by: Chris Mackowski <cmackows@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>

c6c3c570 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
"Here is the big set of driver core patches for 5.15-rc1.

These do change a number of different things across different
subsystems, and because of that, there were 2 stable tags created that
might have already come into your tree from different pulls that did
the following

- changed the bus remove callback to return void

- sysfs iomem_get_mapping rework

Other than those two things, there's only a few small things in here:

- kernfs performance improvements for huge numbers of sysfs users at
once

- tiny api cleanups

- other minor changes

All of these have been in linux-next for a while with no reported
problems, other than the before-mentioned merge issue"

* tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (33 commits)
MAINTAINERS: Add dri-devel for component.[hc]
driver core: platform: Remove platform_device_add_properties()
ARM: tegra: paz00: Handle device properties with software node API
bitmap: extend comment to bitmap_print_bitmask/list_to_buf
drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
topology: use bin_attribute to break the size limitation of cpumap ABI
lib: test_bitmap: add bitmap_print_bitmask/list_to_buf test cases
cpumask: introduce cpumap_print_list/bitmask_to_buf to support large bitmask and list
sysfs: Rename struct bin_attribute member to f_mapping
sysfs: Invoke iomem_get_mapping() from the sysfs open callback
debugfs: Return error during {full/open}_proxy_open() on rmmod
zorro: Drop useless (and hardly used) .driver member in struct zorro_dev
zorro: Simplify remove callback
sh: superhyway: Simplify check in remove callback
nubus: Simplify check in remove callback
nubus: Make struct nubus_driver::remove return void
kernfs: dont call d_splice_alias() under kernfs node lock
kernfs: use i_lock to protect concurrent inode updates
kernfs: switch kernfs to use an rwsem
kernfs: use VFS negative dentry caching
...


9605f75c 30-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: should put a page beyond EOF when preparing a write

The prepare_compress_overwrite() gets/locks a page to prepare a read, and calls
f2fs_read_multi_pages() which checks EOF first. If there's any page beyond EOF,
we unlock the page and set cc->rpages[i] = NULL, which we can't put the page
anymore. This makes page leak, so let's fix by putting that page.

Fixes: a949dc5f2c5c ("f2fs: compress: fix race condition of overwrite vs truncate")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

827f0284 30-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: deallocate compressed pages when error happens

In f2fs_write_multi_pages(), f2fs_compress_pages() allocates pages for
compression work in cc->cpages[]. Then, f2fs_write_compressed_pages() initiates
bio submission. But, if there's any error before submitting the IOs like early
f2fs_cp_error(), previously it didn't free cpages by f2fs_compress_free_page().
Let's fix memory leak by putting that just before deallocating cc->cpages.

Fixes: 4c8ff7095bef ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

0242f642 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix queue stalling race

We need to set the stalled bit early, before we drop the lock for adding
us to the stall hash queue. If not, then we can race with new work being
queued between adding us to the stall hash and io_worker_handle_work()
marking us stalled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

927bc120 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull close_range() cleanup from Christian Brauner:
"This is a cleanup for close_range() which was sent as part of a bugfix
we did some time ago in commit 9b5b872215fe ("file: fix close_range()
for unshare+cloexec").

We used to share more code between some helpers for close_range()
which made retrieving the maximum number of open fds before calling
into the helpers sensible. But with the introduction of
CLOSE_RANGE_CLOEXEC and the need to retrieve the number of maximum fds
once more for CLOSE_RANGE_CLOEXEC that stopped making sense. So the
code was in a dumb in-limbo state.

Fix this by simplifying the code a bit.

The original idea was to only fix the bug itself and make backporting
easy. And since the cleanup wasn't very pressing I left it in
linux-next for a very long time. I didn't pull the patches from the
list again back then which is why they don't have lore-links. So I'm
listing them below explicitly"

Commit 03ba0fe4d09f ("file: simplify logic in __close_range()")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-3-brauner@kernel.org

Commit f49fd6d3c070 ("file: let pick_file() tell caller it's done")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-4-brauner@kernel.org

* tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
file: simplify logic in __close_range()
file: let pick_file() tell caller it's done


1dd5915a 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs.move_mount.move_mount_set_group.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull move_mount updates from Christian Brauner:
"This contains an extension to the move_mount() syscall making it
possible to add a single private mount into an existing propagation
tree.

The use-case comes from the criu folks which have been struggling with
restoring complex mount trees for a long time. Variations of this work
have been discussed at Plumbers before, e.g.

https://www.linuxplumbersconf.org/event/7/contributions/640/

The extension to move_mount() enables criu to restore any set of mount
namespaces, mount trees and sharing group trees without introducing
yet more complexity into mount propagation itself.

The changes required to criu to make use of this and restore complex
propagation trees are available at

https://github.com/Snorch/criu/commits/mount-v2-poc

A cleaned-up version of this will go up for merging into the main criu
repo after this lands"

* tag 'fs.move_mount.move_mount_set_group.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest
move_mount: allow to add a mount into an existing group


0ee7c3e2 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"The most notable externally visible change for this cycle is the
addition of support for reads to inline tail fragments of files, which
was requested by the erofs developers; and a correction for a kernel
memory corruption bug if the sysadmin tries to activate a swapfile
with more pages than the swapfile header suggests.

We also now report writeback completion errors to the file mapping
correctly, instead of munging all errors into EIO.

Internally, the bulk of the changes are Christoph's patchset to reduce
the indirect function call count by a third to a half by converting
iomap iteration from a loop pattern to a generator/consumer pattern.
As an added bonus, fsdax no longer open-codes iomap apply loops.

Summary:

- Simplify the bio_end_page usage in the buffered IO code.

- Support reading inline data at nonzero offsets for erofs.

- Fix some typos and bad grammar.

- Convert kmap_atomic usage in the inline data read path.

- Add some extra inline data input checking.

- Fix a memory corruption bug stemming from iomap_swapfile_activate
trying to activate more pages than mm was expecting.

- Pass errnos through the page writeback code so that writeback
errors are reported correctly instead of being munged to EIO.

- Replace iomap_apply with a open-coded iterator loops to reduce the
number of indirect calls by a third to a half.

- Refactor the fsdax code to use iomap iterators instead of the
open-coded iomap_apply code that it had before.

- Format file range iomap tracepoint data in hexadecimal and
standardize the names used in the pretty-print string"

* tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits)
iomap: standardize tracepoint formatting and storage
mm/swap: consider max pages in iomap_swapfile_add_extent
iomap: move loop control code to iter.c
iomap: constify iomap_iter_srcmap
fsdax: switch the fault handlers to use iomap_iter
fsdax: factor out a dax_fault_actor() helper
fsdax: factor out helpers to simplify the dax fault code
iomap: rework unshare flag
iomap: pass an iomap_iter to various buffered I/O helpers
iomap: remove iomap_apply
fsdax: switch dax_iomap_rw to use iomap_iter
iomap: switch iomap_swapfile_activate to use iomap_iter
iomap: switch iomap_seek_data to use iomap_iter
iomap: switch iomap_seek_hole to use iomap_iter
iomap: switch iomap_bmap to use iomap_iter
iomap: switch iomap_fiemap to use iomap_iter
iomap: switch __iomap_dio_rw to use iomap_iter
iomap: switch iomap_page_mkwrite to use iomap_iter
iomap: switch iomap_zero_range to use iomap_iter
iomap: switch iomap_file_unshare to use iomap_iter
...


916d636e 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'vfs-5.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull project quota update from Darrick Wong:
"A single VFS patch that prevents userspace from setting project quota
ids on files that the VFS considers invalid"

* tag 'vfs-5.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: forbid invalid project ID


8bda9557 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
"New features:

- Support for server-side disconnect injection via debugfs

- Protocol definitions for new RPC_AUTH_TLS authentication flavor

Performance improvements:

- Reduce page allocator traffic in the NFSD splice read actor

- Reduce CPU utilization in svcrdma's Send completion handler

Notable bug fixes:

- Stabilize lockd operation when re-exporting NFS mounts

- Fix the use of %.*s in NFSD tracepoints

- Fix /proc/sys/fs/nfs/nsm_use_hostnames"

* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...


b8ce1b9d 31-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't submit half-prepared drain request

[ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
[ 3784.910926] Call Trace:
[ 3784.910928] ? io_read+0x17c/0x480
[ 3784.910945] io_issue_sqe+0xcb/0x1840
[ 3784.910953] __io_queue_sqe+0x44/0x300
[ 3784.910959] io_req_task_submit+0x27/0x70
[ 3784.910962] tctx_task_work+0xeb/0x1d0
[ 3784.910966] task_work_run+0x61/0xa0
[ 3784.910968] io_run_task_work_sig+0x53/0xa0
[ 3784.910975] __x64_sys_io_uring_enter+0x22/0x30
[ 3784.910977] do_syscall_64+0x3d/0x90
[ 3784.910981] entry_SYSCALL_64_after_hwframe+0x44/0xae

io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
submitting under-prepared request (e.g. failed in io_init_req(). Fail
such drained requests as well.

Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6d3d9cb 31-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix queueing half-created requests

[ 27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
[ 27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[ 27.263730] RIP: 0010:sock_from_file+0x20/0x90
[ 27.272444] Call Trace:
[ 27.272736] io_sendmsg+0x98/0x600
[ 27.279216] io_issue_sqe+0x498/0x68d0
[ 27.281142] __io_queue_sqe+0xab/0xb50
[ 27.285830] io_req_task_submit+0xbf/0x1b0
[ 27.286306] tctx_task_work+0x178/0xad0
[ 27.288211] task_work_run+0xe2/0x190
[ 27.288571] exit_to_user_mode_prepare+0x1a1/0x1b0
[ 27.289041] syscall_exit_to_user_mode+0x19/0x50
[ 27.289521] do_syscall_64+0x48/0x90
[ 27.289871] entry_SYSCALL_64_after_hwframe+0x44/0xae

io_req_complete_failed() -> io_req_complete_post() ->
io_req_task_queue() still would try to enqueue hard linked request,
which can be half prepared (e.g. failed init), so we can't allow
that to happen.

Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic")
Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

08bdbd39 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: ensure that hash wait lock is IRQ disabling

A previous commit removed the IRQ safety of the worker and wqe locks,
but that left one spot of the hash wait lock now being done without
already having IRQs disabled.

Ensure that we use the right locking variant for the hashed waitqueue
lock.

Fixes: a9a4aa9fbfc5 ("io-wq: wqe and worker locks no longer need to be IRQ safe")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7db30437 21-Aug-2021 Ming Lei <ming.lei@redhat.com>

io_uring: retry in case of short read on block device

In case of buffered reading from block device, when short read happens,
we should retry to read more, otherwise the IO will be completed
partially, for example, the following fio expects to read 2MB, but it
can only read 1M or less bytes:

fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting

Fix the issue by allowing short read retry for block device, which sets
FMODE_BUF_RASYNC really.

Fixes: 9a173346bd9e ("io_uring: fix short read retries for non-reg files")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b3188e7 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: IORING_OP_WRITE needs hash_reg_file set

During some testing, it became evident that using IORING_OP_WRITE doesn't
hash buffered writes like the other writes commands do. That's simply
an oversight, and can cause performance regressions when doing buffered
writes with this command.

Correct that and add the flag, so that buffered writes are correctly
hashed when using the non-iovec based write command.

Cc: stable@vger.kernel.org
Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94ffb0a2 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix race between adding work and activating a free worker

The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.

Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4529fb15 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'gfs2-v5.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

- Various withdraw related fixes (freeze glock recursion, thread
initialization / destruction order, journal recovery, glock cleanup,
withdraw under journal lock).

- Some error message improvements.

- Various minor cleanups.

* tag 'gfs2-v5.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Remove redundant check from gfs2_glock_dq
gfs2: Delay withdraw from atomic context
gfs2: Don't call dlm after protocol is unmounted
gfs2: don't stop reads while withdraw in progress
gfs2: Mark journal inodes as "don't cache"
gfs2: nit: gfs2_drop_inode shouldn't return bool
gfs2: Eliminate vestigial HIF_FIRST
gfs2: Make recovery error more readable
gfs2: Don't release and reacquire local statfs bh
gfs2: init system threads before freeze lock
gfs2: tiny cleanup in gfs2_log_reserve
gfs2: trivial clean up of gfs2_ail_error
gfs2: be more verbose replaying invalid rgrp blocks
gfs2: Fix glock recursion in freeze_go_xmote_bh
gfs2: Fix memory leak of object lsi on error return path


cd358208 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
"Some small fixes and cleanups for fs/crypto/:

- Fix ->getattr() for ext4, f2fs, and ubifs to report the correct
st_size for encrypted symlinks

- Use base64url instead of a custom Base64 variant

- Document struct fscrypt_operations"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: document struct fscrypt_operations
fscrypt: align Base64 encoding with RFC 4648 base64url
fscrypt: remove mention of symlink st_size quirk from documentation
ubifs: report correct st_size for encrypted symlinks
f2fs: report correct st_size for encrypted symlinks
ext4: report correct st_size for encrypted symlinks
fscrypt: add fscrypt_symlink_getattr() for computing st_size


87045e65 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.

There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.

Features:

- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity

- idmapped mount support

- make mount with rescue=ibadroots more tolerant to partially damaged
trees

- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles

- zoned mode block group auto reclaim can be disabled via sysfs knob

Performance improvements:

- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)

- batching of delayed items can speed up creating many files

- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)

Fixes:

- various zoned mode fixes

- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems

Core:

- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount

- improved flushing logic, fixing early ENOSPC on some workloads

- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity

- new tree items for fs-verity
- descriptor item
- Merkle tree item

- inode operations extended to be namespace-aware

- cleanups and refactoring

Generic code changes:

- fs: new export filemap_fdatawrite_wbc

- fs: removed sync_inode

- block: bio_trim argument type fixups

- vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...


9c849ce8 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client updates from Steve French:
"Eleven cifs/smb3 client fixes:

- mostly restructuring to allow disabling less secure algorithms
(this will allow eventual removing rc4 and md4 from general use in
the kernel)

- four fixes, including two for stable

- enable r/w support with fscache and cifs.ko

I am working on a larger set of changes (the usual ... multichannel,
auth and signing improvements), but wanted to get these in earlier to
reduce chance of merge conflicts later in the merge window"

* tag '5.15-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED
cifs: add cifs_common directory to MAINTAINERS file
cifs: cifs_md4 convert to SPDX identifier
cifs: create a MD4 module and switch cifs.ko to use it
cifs: fork arc4 and create a separate module for it for cifs and other users
cifs: remove support for NTLM and weaker authentication algorithms
cifs: enable fscache usage even for files opened as rw
oid_registry: Add OIDs for missing Spnego auth mechanisms to Macs
smb3: fix posix extensions mount option
cifs: fix wrong release in sess_alloc_buffer() failed path
CIFS: Fix a potencially linear read overflow


e24c567b 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-first-ksmbd-merge' of git://git.samba.org/ksmbd

Pull initial ksmbd implementation from Steve French:
"Initial merge of kernel smb3 file server, ksmbd.

The SMB family of protocols is the most widely deployed network
filesystem protocol, the default on Windows and Macs (and even on many
phones and tablets), with clients and servers on all major operating
systems, but lacked a kernel server for Linux. For many cases the
current userspace server choices were suboptimal either due to memory
footprint, performance or difficulty integrating well with advanced
Linux features.

ksmbd is a new kernel module which implements the server-side of the
SMB3 protocol. The target is to provide optimized performance, GPLv2
SMB server, and better lease handling (distributed caching). The
bigger goal is to add new features more rapidly (e.g. RDMA aka
"smbdirect", and recent encryption and signing improvements to the
protocol) which are easier to develop on a smaller, more tightly
optimized kernel server than for example in Samba.

The Samba project is much broader in scope (tools, security services,
LDAP, Active Directory Domain Controller, and a cross platform file
server for a wider variety of purposes) but the user space file server
portion of Samba has proved hard to optimize for some Linux workloads,
including for smaller devices.

This is not meant to replace Samba, but rather be an extension to
allow better optimizing for Linux, and will continue to integrate well
with Samba user space tools and libraries where appropriate. Working
with the Samba team we have already made sure that the configuration
files and xattrs are in a compatible format between the kernel and
user space server.

Various types of functional and regression tests are regularly run
against it. One example is the automated 'buildbot' regression tests
which use the Linux client to test against ksmbd, e.g.

http://smb3-test-rhel-75.southcentralus.cloudapp.azure.com/#/builders/8/builds/56

but other test suites, including Samba's smbtorture functional test
suite are also used regularly"

* tag '5.15-rc-first-ksmbd-merge' of git://git.samba.org/ksmbd: (219 commits)
ksmbd: fix __write_overflow warning in ndr_read_string
MAINTAINERS: ksmbd: add cifs_common directory to ksmbd entry
MAINTAINERS: ksmbd: update my email address
ksmbd: fix permission check issue on chown and chmod
ksmbd: don't set FILE DELETE and FILE_DELETE_CHILD in access mask by default
MAINTAINERS: add git adddress of ksmbd
ksmbd: update SMB3 multi-channel support in ksmbd.rst
ksmbd: smbd: fix kernel oops during server shutdown
ksmbd: remove select FS_POSIX_ACL in Kconfig
ksmbd: use proper errno instead of -1 in smb2_get_ksmbd_tcon()
ksmbd: update the comment for smb2_get_ksmbd_tcon()
ksmbd: change int data type to boolean
ksmbd: Fix multi-protocol negotiation
ksmbd: fix an oops in error handling in smb2_open()
ksmbd: add ipv6_addr_v4mapped check to know if connection from client is ipv4
ksmbd: fix missing error code in smb2_lock
ksmbd: use channel signingkey for binding SMB2 session setup
ksmbd: don't set RSS capable in FSCTL_QUERY_NETWORK_INTERFACE_INFO
ksmbd: Return STATUS_OBJECT_PATH_NOT_FOUND if smb2_creat() returns ENOENT
ksmbd: fix -Wstringop-truncation warnings
...


d3624466 31-Aug-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Restyle comments to better align with kernel-doc

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

78ab59fe 31-Aug-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Rework file operations

Rename now works "Add new name and remove old name".
"Remove old name and add new name" may result in bad inode
if we can't add new name and then can't restore (add) old name.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

a97131c2 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now

For some reason we have FAT ioctl calls. Even old ntfs driver did not
use these. We should not use these because it his hard to get things out
of kernel when they are upstream. That's why we remove these for now.

More discussion is needed what ioctl should be implemented and what is
important.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

59bda8ec 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: flush extending writes

Callers of fuse_writeback_range() assume that the file is ready for
modification by the server in the supplied byte range after the call
returns.

If there's a write that extends the file beyond the end of the supplied
range, then the file needs to be extended to at least the end of the range,
but currently that's not done.

There are at least two cases where this can cause problems:

- copy_file_range() will return short count if the file is not extended
up to end of the source range.

- FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
hence the region may not be fully allocated.

Fix by flushing writes from the start of the range up to the end of the
file. This could be optimized if the writes are non-extending, etc, but
it's probably not worth the trouble.

Fixes: a2bc92362941 ("fuse: fix copy_file_range() in the writeback case")
Fixes: 6b1bdb56b17c ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
Cc: <stable@vger.kernel.org> # v5.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

baaae979 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: make the updating inode data procedure atomic

Now that ext4_do_update_inode() return error before filling the whole
inode data if we fail to set inode blocks in ext4_inode_blocks_set().
This error should never happen in theory since sb->s_maxbytes should not
have allowed this, we have already init sb->s_maxbytes according to this
feature in ext4_fill_super(). So even through that could only happen due
to the filesystem corruption, we'd better to return after we finish
updating the inode because it may left an uninitialized buffer and we
could read this buffer later in "errors=continue" mode.

This patch make the updating inode data procedure atomic, call
EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad
happened, make sure that the inode is integrated, and also drop a BUG_ON
and do some small cleanups.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

8e33fadf 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: remove an unnecessary if statement in __ext4_get_inode_loc()

The "if (!buffer_uptodate(bh))" hunk covered almost the whole code after
getting buffer in __ext4_get_inode_loc() which seems unnecessary, remove
it and switch to check ext4_buffer_uptodate(), it simplify code and make
it more readable.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

0904c9ae 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: move inode eio simulation behind io completeion

No EIO simulation is required if the buffer is uptodate, so move the
simulation behind read bio completeion just like inode/block bitmap
simulation does.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

4a79a98c 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Improve scalability of ext4 orphan file handling

Even though the length of the critical section when adding / removing
orphaned inodes was significantly reduced by using orphan file, the
contention of lock protecting orphan file still appears high in profiles
for truncate / unlink intensive workloads with high number of threads.

This patch makes handling of orphan file completely lockless. Also to
reduce conflicts between CPUs different CPUs start searching for empty
slot in orphan file in different blocks.

Performance comparison of locked orphan file handling, lockless orphan
file handling, and completely disabled orphan inode handling
from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on
SAS SSD disk, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time Time Time
Orphan locked Orphan lockless No orphan
1 0.945600 0.939400 0.891200
2 1.331800 1.246600 1.174400
4 1.995000 1.780600 1.713200
8 6.424200 4.900000 4.106000
16 14.937600 8.516400 8.138000
32 33.038200 24.565600 24.002200
64 60.823600 39.844600 38.440200
128 122.941400 70.950400 69.315000

So we can see that with lockless orphan file handling, addition /
deletion of orphaned inodes got almost completely out of picture even
for a microbenchmark stressing it.

For reaim creat_clo workload on ramdisk there are also noticeable gains
(average of 5 runs):

Clients Vanilla (ops/s) Patched (ops/s)
creat_clo-1 14705.88 ( 0.00%) 14354.07 * -2.39%*
creat_clo-3 27108.43 ( 0.00%) 28301.89 ( 4.40%)
creat_clo-5 37406.48 ( 0.00%) 45180.73 * 20.78%*
creat_clo-7 41338.58 ( 0.00%) 54687.50 * 32.29%*
creat_clo-9 45226.13 ( 0.00%) 62937.07 * 39.16%*
creat_clo-11 44000.00 ( 0.00%) 65088.76 * 47.93%*
creat_clo-13 36516.85 ( 0.00%) 68661.97 * 88.03%*
creat_clo-15 30864.20 ( 0.00%) 69551.78 * 125.35%*
creat_clo-17 27478.45 ( 0.00%) 67729.08 * 146.48%*
creat_clo-19 25000.00 ( 0.00%) 61621.62 * 146.49%*
creat_clo-21 18772.35 ( 0.00%) 63829.79 * 240.02%*
creat_clo-23 16698.94 ( 0.00%) 61938.96 * 270.92%*
creat_clo-25 14973.05 ( 0.00%) 56947.61 * 280.33%*
creat_clo-27 16436.69 ( 0.00%) 65008.03 * 295.51%*
creat_clo-29 13949.01 ( 0.00%) 69047.62 * 395.00%*
creat_clo-31 14283.52 ( 0.00%) 67982.45 * 375.95%*

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

02f310fc 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Speedup ext4 orphan inode handling

Ext4 orphan inode handling is a bottleneck for workloads which heavily
truncate / unlink small files since it contends on the global
s_orphan_mutex lock (and generally it's difficult to improve scalability
of the ondisk linked list of orphaned inodes).

This patch implements new way of handling orphan inodes. Instead of
linking orphaned inode into a linked list, we store it's inode number in
a new special file which we call "orphan file". Only if there's no more
space in the orphan file (too many inodes are currently orphaned) we
fall back to using old style linked list. Currently we protect
operations in the orphan file with a spinlock for simplicity but even in
this setting we can substantially reduce the length of the critical
section and thus speedup some workloads. In the next patch we improve
this by making orphan handling lockless.

Note that the change is backwards compatible when the filesystem is
clean - the existence of the orphan file is a compat feature, we set
another ro-compat feature indicating orphan file needs scanning for
orphaned inodes when mounting filesystem read-write. This ro-compat
feature gets cleared on unmount / remount read-only.

Some performance data from 80 CPU Xeon Server with 512 GB of RAM,
filesystem located on SSD, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time Time
Vanilla Patched
1 1.057200 0.945600
2 1.680400 1.331800
4 2.547000 1.995000
8 7.049400 6.424200
16 14.827800 14.937600
32 40.948200 33.038200
64 87.787400 60.823600
128 206.504000 122.941400

So we can see significant wins all over the board.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

25c6d98f 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Move orphan inode handling into a separate file

Move functions for handling orphan inodes into a new file
fs/ext4/orphan.c to have them in one place and somewhat reduce size of
other files. No code changes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

a54c4613 20-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix race writing to an inline_data file while its xattrs are changing

The location of the system.data extended attribute can change whenever
xattr_sem is not taken. So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().

This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.

Cc: stable@kernel.org
Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data")
Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

b33d9f59 14-Aug-2021 Theodore Ts'o <tytso@mit.edu>

jbd2: add sparse annotations for add_transaction_credits()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

a5fda113 14-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix sparse warnings

Add sparse annotations to suppress false positive context imbalance
warnings, and use NULL instead of 0 in EXT_MAX_{EXTENT,INDEX}.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

bd2c38cf 12-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Make sure quota files are not grabbed accidentally

If ext4 filesystem is corrupted so that quota files are linked from
directory hirerarchy, bad things can happen. E.g. quota files can get
corrupted or deleted. Make sure we are not grabbing quota file inodes
when we expect normal inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210812133122.26360-1-jack@suse.cz

b2bbb92f 12-Aug-2021 Jan Kara <jack@suse.cz>

ext4: fix e2fsprogs checksum failure for mounted filesystem

Commit 81414b4dd48 ("ext4: remove redundant sb checksum
recomputation") removed checksum recalculation after updating
superblock free space / inode counters in ext4_fill_super() based on
the fact that we will recalculate the checksum on superblock
writeout.

That is correct assumption but until the writeout happens (which can
take a long time) the checksum is incorrect in the buffer cache and if
programs such as tune2fs or resize2fs is called shortly after a file
system is mounted can fail. So return back the checksum recalculation
and add a comment explaining why.

Fixes: 81414b4dd48f ("ext4: remove redundant sb checksum recomputation")
Cc: stable@kernel.org
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210812124737.21981-1-jack@suse.cz

308c57cc 13-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: if zeroout fails fall back to splitting the extent node

If the underlying storage device is using thin-provisioning, it's
possible for a zeroout operation to return ENOSPC.

Commit df22291ff0fd ("ext4: Retry block allocation if we have free blocks
left") added logic to retry block allocation since we might get free block
after we commit a transaction. But the ENOSPC from thin-provisioning
will confuse ext4, and lead to an infinite loop.

Since using zeroout instead of splitting the extent node is an
optimization, if it fails, we might as well fall back to splitting the
extent node.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

facec450 27-Jul-2021 Guoqing Jiang <jiangguoqing@kylinos.cn>

ext4: reduce arguments of ext4_fc_add_dentry_tlv

Let's pass fc_dentry directly since those arguments (tag, parent_ino and
ino etc) can be deferenced from it.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210727080708.3708814-1-guoqing.jiang@linux.dev

5036ab8d 30-Aug-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: flush background discard kwork when retry allocation

The background discard kwork tries to mark blocks used and issue
discard. This can make filesystem suffer from NOSPC error, xfstest
generic/371 can fail due to it. Fix it by flushing discard kwork
in ext4_should_retry_alloc. At the same time, give up discard at
the moment.

Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

55cdd0af 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: get discard out of jbd2 commit kthread contex

Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.

One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.

This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
- mark blocks to be discarded as used if they have not been allocated
- do discard
- mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.

Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

b91db6a0 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"

* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()


3b629f8d 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block

Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.

It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"

* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio


c547d89a 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

- cancellation cleanups (Hao, Pavel)

- io-wq accounting cleanup (Hao)

- io_uring submit locking fix (Hao)

- io_uring link handling fixes (Hao)

- fixed file improvements (wangyangbo, Pavel)

- allow updates of linked timeouts like regular timeouts (Pavel)

- IOPOLL fix (Pavel)

- remove batched file get optimization (Pavel)

- improve reference handling (Pavel)

- IRQ task_work batching (Pavel)

- allow pure fixed file, and add support for open/accept (Pavel)

- GFP_ATOMIC RT kernel fix

- multiple CQ ring waiter improvement

- funnel IRQ completions through task_work

- add support for limiting async workers explicitly

- add different clocksource support for timeouts

- io-wq wakeup race fix

- lots of cleanups and improvement (Pavel et al)

* tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits)
io-wq: fix wakeup race when adding new work
io-wq: wqe and worker locks no longer need to be IRQ safe
io-wq: check max_worker limits if a worker transitions bound state
io_uring: allow updating linked timeouts
io_uring: keep ltimeouts in a list
io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts
io-wq: provide a way to limit max number of workers
io_uring: add build check for buf_index overflows
io_uring: clarify io_req_task_cancel() locking
io_uring: add task-refs-get helper
io_uring: fix failed linkchain code logic
io_uring: remove redundant req_set_fail()
io_uring: don't free request to slab
io_uring: accept directly into fixed file table
io_uring: hand code io_accept() fd installing
io_uring: openat directly into fixed fd table
net: add accept helper not installing fd
io_uring: fix io_try_cancel_userdata race for iowq
io_uring: IRQ rw completion batching
io_uring: batch task work locking
...


67936911 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
"Nothing major in here - lots of good cleanups and tech debt handling,
which is also evident in the diffstats. In particular:

- Add disk sequence numbers (Matteo)

- Discard merge fix (Ming)

- Relax disk zoned reporting restrictions (Niklas)

- Bio error handling zoned leak fix (Pavel)

- Start of proper add_disk() error handling (Luis, Christoph)

- blk crypto fix (Eric)

- Non-standard GPT location support (Dmitry)

- IO priority improvements and cleanups (Damien)o

- blk-throtl improvements (Chunguang)

- diskstats_show() stack reduction (Abd-Alrhman)

- Loop scheduler selection (Bart)

- Switch block layer to use kmap_local_page() (Christoph)

- Remove obsolete disk_name helper (Christoph)

- block_device refcounting improvements (Christoph)

- Ensure gendisk always has a request queue reference (Christoph)

- Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
sg: pass the device name to blk_trace_setup
block, bfq: cleanup the repeated declaration
blk-crypto: fix check for too-large dun_bytes
blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
block: mark blkdev_fsync static
block: refine the disk_live check in del_gendisk
mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
mmc: block: Support alternative_gpt_sector() operation
partitions/efi: Support non-standard GPT location
block: Add alternative_gpt_sector() operation
bio: fix page leak bio_add_hw_page failure
block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
block: remove a pointless call to MINOR() in device_add_disk
null_blk: add error handling support for add_disk()
virtio_blk: add error handling support for add_disk()
block: add error handling for device_add_disk / add_disk
block: return errors from disk_alloc_events
block: return errors from blk_integrity_add
block: call blk_register_queue earlier in device_add_disk
...


8596e589 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and related drivers:

Core code:

- Cure a couple of correctness issues in the posix CPU timer code to
prevent that the tick dependency for NOHZ full is kept alive for no
reason.

- Avoid expensive double reprogramming of the clockevent device in
hrtimer_start_range_ns().

- Avoid pointless SMP function calls when the clock was set to avoid
disturbing CPUs which do not have any affected timers queued.

- Make the clocksource watchdog test work correctly when CONFIG_HZ is
less than 100.

Drivers:

- Prefer the ARM architected timer over the Exynos timer which is way
more expensive to access.

- Add device tree bindings for new Ingenic SoCs

- The usual improvements and cleanups all over the place"

* tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
clocksource: Make clocksource watchdog test safe for slow-HZ systems
dt-bindings: timer: Add ABIs for new Ingenic SoCs
clocksource/drivers/fttmr010: Pass around less pointers
clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown
clocksource/drivers/ingenic: Use bitfield macro helpers
clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
dt-bindings: timer: convert rockchip,rk-timer.txt to YAML
clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU
clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64
hrtimer: Unbreak hrtimer_force_reprogram()
hrtimer: Use raw_cpu_ptr() in clock_was_set()
hrtimer: Avoid more SMP function calls in clock_was_set()
hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
hrtimer: Add bases argument to clock_was_set()
time/timekeeping: Avoid invoking clock_was_set() twice
timekeeping: Distangle resume and clock-was-set events
timerfd: Provide timerfd_resume()
hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
hrtimer: Ensure timerfd notification for HIGHRES=n
hrtimer: Consolidate reprogramming code
...


5d3c0db4 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks
on AArch32 systems that also have 64-bit-only CPUs.

Architectures can fill in this functionality by defining their own
task_cpu_possible_mask(p). When this is done, the scheduler will make
sure the task will only be scheduled on CPUs that support it.

(The actual arm64 specific changes are not part of this tree.)

For other architectures there will be no change in functionality.

- Add cgroup SCHED_IDLE support

- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)

- Deadline scheduler enhancements & fixes

- Misc fixes & cleanups.

* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
eventfd: Make signal recursion protection a task bit
sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
sched: Replace deprecated CPU-hotplug functions.
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
sched: Fix UCLAMP_FLAG_IDLE setting
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
sched/fair: Avoid a second scan of target in select_idle_cpu
sched/fair: Use prev instead of new target as recent_used_cpu
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
...


6f01c935 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
"This starts with a couple of fixes for potential deadlocks in the
fowner/fasync handling.

The next patch removes the old mandatory locking code from the kernel
altogether.

The last patch cleans up rw_verify_area a bit more after the mandatory
locking removal"

* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: clean up after mandatory file locking support removal
fs: remove mandatory file locking support
fcntl: fix potential deadlock for &fasync_struct.fa_lock
fcntl: fix potential deadlocks for &fown_struct.lock


aa99f3c2 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.

This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex


8cfb9015 27-Aug-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

NFS: Always provide aligned buffers to the RPC read layers

Instead of messing around with XDR padding in the RDMA layer, we should
just give the RPC layer an aligned buffer. Try to avoid creating extra
RPC calls by aligning to the smaller value of ALIGN(len, rsize) and
PAGE_SIZE.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

a1ca8e71 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF and isofs updates from Jan Kara:
"Several smaller fixes and cleanups in UDF and isofs"

* tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf_get_extendedattr() had no boundary checks.
isofs: joliet: Fix iocharset=utf8 mount option
udf: Fix iocharset=utf8 mount option
udf: Get rid of 0-length arrays in struct fileIdentDesc
udf: Get rid of 0-length arrays
udf: Remove unused declaration
udf: Check LVID earlier


63b0c403 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull FIEMAP cleanups from Jan Kara:
"FIEMAP cleanups from Christoph transitioning all remaining filesystems
supporting FIEMAP (ext2, hpfs) to iomap API and removing the old
helper"

* tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: remove generic_block_fiemap
hpfs: use iomap_fiemap to implement ->fiemap
ext2: use iomap_fiemap to implement ->fiemap
ext2: make ext2_iomap_ops available unconditionally


f7db8dd6 29-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: enable realtime discard iff device supports discard

Let's only enable realtime discard if and only if device supports
discard functionality.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

dddd3d65 19-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: guarantee to write dirty data when enabling checkpoint back

We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

c8dc3047 25-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix to unmap pages from userspace process in punch_hole()

We need to unmap pages from userspace process before removing pagecache
in punch_hole() like we did in f2fs_setattr().

Similar change:
commit 5e44f8c374dc ("ext4: hole-punch use truncate_pagecache_range")

Fixes: fbfa2cc58d53 ("f2fs: add file operations")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

adf9ea89 25-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()

In below path, it will return ENOENT if filesystem is shutdown:

- f2fs_map_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_page
- __get_node_page
- read_node_page
- is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
return -ENOENT
- force return value from ENOENT to 0

It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.

However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.

There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.

xfstest generic/388 reports below warning:

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
iomap_apply+0x1c4/0x7b0
? dax_iomap_rw+0x1c0/0x1c0
dax_iomap_rw+0xad/0x1c0
? dax_iomap_rw+0x1c0/0x1c0
f2fs_file_write_iter+0x5ab/0x970 [f2fs]
do_iter_readv_writev+0x273/0x2e0
do_iter_write+0xab/0x1f0
vfs_iter_write+0x21/0x40
iter_file_splice_write+0x287/0x540
do_splice+0x37c/0xa60
__x64_sys_splice+0x15f/0x3a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
dax_iomap_fault+0x44/0x70
f2fs_dax_huge_fault+0x155/0x400 [f2fs]
f2fs_dax_fault+0x18/0x30 [f2fs]
__do_fault+0x4e/0x120
do_fault+0x3cf/0x7a0
__handle_mm_fault+0xa8c/0xf20
? find_held_lock+0x39/0xd0
handle_mm_fault+0x1b6/0x480
do_user_addr_fault+0x320/0xcd0
? rcu_read_lock_sched_held+0x67/0xc0
exc_page_fault+0x77/0x3f0
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30

Fixes: 83a3bfdb5a8a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

ad126ebd 23-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix to account missing .skipped_gc_rwsem

There is a missing place we forgot to account .skipped_gc_rwsem, fix it.

Fixes: 6f8d4455060d ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

d75da8c8 23-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: adjust unlock order for cleanup

This patch adjusts unlock order of .i_mmap_sem and .i_gc_rwsem for
cleanup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

4d674904 19-Aug-2021 Fengnan Chang <changfengnan@vivo.com>

f2fs: Don't create discard thread when device doesn't support realtime discard

Don't create discard thread when device doesn't support realtime discard
or user specifies nodiscard mount option.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

35134319 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
"fsnotify speedups when notification actually isn't used and support
for identifying processes which caused fanotify events through pidfd
instead of normal pid"

* tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: optimize the case of no marks of any type
fsnotify: count all objects with attached connectors
fsnotify: count s_fsnotify_inode_refs for attached connectors
fsnotify: replace igrab() with ihold() on attach connector
fanotify: add pidfd support to the fanotify API
fanotify: introduce a generic info record copying helper
fanotify: minor cosmetic adjustments to fid labels
kernel/pid.c: implement additional checks upon pidfd_create() parameters
kernel/pid.c: remove static qualifier from pidfd_create()


e8b8e97f 03-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Restyle comments to better align with kernel-doc

Capitalize comments and end with period for better reading.

Also function comments are now little more kernel-doc style. This way we
can easily convert them to kernel-doc style if we want. Note that these
are not yet complete with this style. Example function comments start
with /* and in kernel-doc style they start /**.

Use imperative mood in function descriptions.

Change words like ntfs -> NTFS, linux -> Linux.

Use "we" not "I" when commenting code.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

87df7fb9 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix wakeup race when adding new work

When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:

CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed

Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9a4aa9f 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: wqe and worker locks no longer need to be IRQ safe

io_uring no longer queues async work off completion handlers that run in
hard or soft interrupt context, and that use case was the only reason that
io-wq had to use IRQ safe locks for wqe and worker locks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecc53c48 29-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: check max_worker limits if a worker transitions bound state

For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.

If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.

When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.

Cc: stable@vger.kernel.org
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f1042b6c 28-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: allow updating linked timeouts

We allow updating normal timeouts, add support for adjusting timings of
linked timeouts as well.

Reported-by: Victor Stewart <v@nametag.social>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef9dd637 28-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: keep ltimeouts in a list

A preparation patch. Keep all queued linked timeout in a list, so they
may be found and updated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

50c1df2b 27-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts

Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.

Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.

Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.

Link: https://github.com/axboe/liburing/issues/369
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e480058 27-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: provide a way to limit max number of workers

io-wq divides work into two categories:

1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.

2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.

For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.

The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.

Reported-by: Johannes Lundberg <johalun0@gmail.com>
Link: https://github.com/axboe/liburing/issues/420
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b542e383 29-Jul-2021 Thomas Gleixner <tglx@linutronix.de>

eventfd: Make signal recursion protection a task bit

The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.

Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.

Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.

No functional change for !RT kernels.

Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx

2a7a451a 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

NFSv4.1 add network transport when session trunking is detected

After trunking is discovered in nfs4_discover_server_trunking(),
add the transport to the old client structure if the allowed limit
of transports has not been reached.

An example: there exists a multi-homed server and client mounts
one server address and some volume and then doest another mount to
a different address of the same server and perhaps a different
volume. Previously, the client checks that this is a session
trunkable servers (same server), and removes the newly created
client structure along with its transport. Now, the client
adds the connection from the 2nd mount into the xprt switch of
the existing client (it leads to having 2 available connections).

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

dc48e0ab 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

SUNRPC enforce creation of no more than max_connect xprts

If we are adding new transports via rpc_clnt_test_and_add_xprt()
then check if we've reached the limit. Currently only pnfs path
adds transports via that function but this is done in
preparation when the client would add new transports when
session trunking is detected. A warning is logged if the
limit is reached.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7e134205 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

NFSv4 introduce max_connect mount options

This option will control up to how many xprts can the client
establish to the server with a distinct address (that means
nconnect connections are not counted towards this new limit).
This patch is setting up nfs structures to keeep track of the
max_connect limit (does not enforce it).

The default value is kept at 1 so that no current mounts that
don't want any additional connections would be effected. The
maximum value is set at 16.

Mounts to DS are not limited to default value of 1 but instead
set to the maximum default value of 16 (NFS_MAX_TRANSPORTS).

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

79d534f8 17-Aug-2021 Ye Bin <yebin10@huawei.com>

NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox

As eb96d5c97b08 ("SUNRPC handle EKEYEXPIRED in call_refreshresult")
commit handle EKEYEXPIRED in call_refreshresult, so there is only handle
when "task->tk_status" is equal "-EJUKEBOX" in nfs3_async_handle_jukebox.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7d5d8d71 26-Aug-2021 Namjae Jeon <namjae.jeon@samsung.com>

ksmbd: fix __write_overflow warning in ndr_read_string

Dan reported __write_overflow warning in ndr_read_string.

CC [M] fs/ksmbd/ndr.o
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:11,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/cpumask.h:5,
from ./arch/x86/include/asm/msr.h:11,
from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:60,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from ./include/linux/wait.h:9,
from ./include/linux/wait_bit.h:8,
from ./include/linux/fs.h:6,
from fs/ksmbd/ndr.c:7:
In function memcpy,
inlined from ndr_read_string at fs/ksmbd/ndr.c:86:2,
inlined from ndr_decode_dos_attr at fs/ksmbd/ndr.c:167:2:
./include/linux/fortify-string.h:219:4: error: call to __write_overflow
declared with attribute error: detected write beyond size of object
__write_overflow();
^~~~~~~~~~~~~~~~~~

This seems to be a false alarm because hex_attr size is always smaller
than n->length. This patch fix this warning by allocation hex_attr with
n->length.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

90499ad0 25-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add build check for buf_index overflows

req->buf_index is u16 and so we rely on registered buffers indexes
fitting into it. Add a build check, so when the upper limit for the
number of buffers is lifted we get a compliation fail but not lurking
problems.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b18a1a45 25-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clarify io_req_task_cancel() locking

It's too easy to forget and misjudge about synchronisation in
io_req_task_cancel(), add a comment clarifying it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71099083835f983a1fd73d5a3da6391924da8300.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b8155e95 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Fix error handling in indx_insert_into_root()

There are three bugs in this code:
1) If indx_get_root() fails, then return -EINVAL instead of success.
2) On the "/* make root external */" -EOPNOTSUPP; error path it should
free "re" but it has a memory leak.
3) If indx_new() fails then it will lead to an error pointer dereference
when we call put_indx_node().

I've re-written the error handling to be more clear.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

8c83a485 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Potential NULL dereference in hdr_find_split()

The "e" pointer is dereferenced before it has been checked for NULL.
Move the dereference after the NULL check to prevent an Oops.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

04810f00 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Fix error code in indx_add_allocate()

Return -EINVAL if ni_find_attr() fails. Don't return success.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

2926e429 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: fix an error code in ntfs_get_acl_ex()

The ntfs_get_ea() function returns negative error codes or on success
it returns the length. In the original code a zero length return was
treated as -ENODATA and results in a NULL return. But it should be
treated as an invalid length and result in an PTR_ERR(-EINVAL) return.

Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

a1b04d38 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: add checks for allocation failure

Add a check for when the kzalloc() in init_rsttbl() fails. Some of
the callers checked for NULL and some did not. I went down the call
tree and added NULL checks where ever they were missing.

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

345482bc 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc

Use kcalloc/kmalloc_array over kzalloc/kmalloc when we allocate array.
Checkpatch found these after we did not use our own defined allocation
wrappers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

195c52bd 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Do not use driver own alloc wrappers

Problem with these wrapper is that we cannot take off example GFP_NOFS
flag. It is not recomended use those in all places. Also if we change
one driver specific wrapper to kernel wrapper then it would look really
weird. People should be most familiar with kernel wrappers so let's just
use those ones.

Driver specific alloc wrapper also confuse some static analyzing tools,
good example is example kernels checkpatch tool. After we converter
these to kernel specific then warnings is showed.

Following Coccinelle script was used to automate changing.

virtual patch

@alloc depends on patch@
expression x;
expression y;
@@
(
- ntfs_malloc(x)
+ kmalloc(x, GFP_NOFS)
|
- ntfs_zalloc(x)
+ kzalloc(x, GFP_NOFS)
|
- ntfs_vmalloc(x)
+ kvmalloc(x, GFP_NOFS)
|
- ntfs_free(x)
+ kfree(x)
|
- ntfs_vfree(x)
+ kvfree(x)
|
- ntfs_memdup(x, y)
+ kmemdup(x, y, GFP_NOFS)
)

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fa3cacf5 26-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use kernel ALIGN macros over driver specific

The static checkers (Smatch) were complaining because QuadAlign() was
buggy. If you try to align something higher than UINT_MAX it got
truncated to a u32.

Smatch warning was:
fs/ntfs3/attrib.c:383 attr_set_size_res()
warn: was expecting a 64 bit value instead of '~7'

So that this will not happen again we will change all these macros to
kernel made ones. This can also help some other static analyzing tools
to give us better warnings.

Patch was generated with Coccinelle script and after that some style
issue was hand fixed.

Coccinelle script:

virtual patch

@alloc depends on patch@
expression x;
@@
(
- #define QuadAlign(n) (((n) + 7u) & (~7u))
|
- QuadAlign(x)
+ ALIGN(x, 8)
|
- #define IsQuadAligned(n) (!((size_t)(n)&7u))
|
- IsQuadAligned(x)
+ IS_ALIGNED(x, 8)
|
- #define Quad2Align(n) (((n) + 15u) & (~15u))
|
- Quad2Align(x)
+ ALIGN(x, 16)
|
- #define IsQuad2Aligned(n) (!((size_t)(n)&15u))
|
- IsQuad2Aligned(x)
+ IS_ALIGNED(x, 16)
|
- #define Quad4Align(n) (((n) + 31u) & (~31u))
|
- Quad4Align(x)
+ ALIGN(x, 32)
|
- #define IsSizeTAligned(n) (!((size_t)(n) & (sizeof(size_t) - 1)))
|
- IsSizeTAligned(x)
+ IS_ALIGNED(x, sizeof(size_t))
|
- #define DwordAlign(n) (((n) + 3u) & (~3u))
|
- DwordAlign(x)
+ ALIGN(x, 4)
|
- #define IsDwordAligned(n) (!((size_t)(n)&3u))
|
- IsDwordAligned(x)
+ IS_ALIGNED(x, 4)
|
- #define WordAlign(n) (((n) + 1u) & (~1u))
|
- WordAlign(x)
+ ALIGN(x, 2)
|
- #define IsWordAligned(n) (!((size_t)(n)&1u))
|
- IsWordAligned(x)
+ IS_ALIGNED(x, 2)
|
)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

24516d48 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Restyle comment block in ni_parse_reparse()

First of this fix one none utf8 char in this comment block. Maybe
this happened because error in filesystem ;)

Also this block was hard to read because long lines so make it max 80
long. And while we doing this stuff make little better grammer.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

1263eddf 19-Aug-2021 Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

fs/ntfs3: Remove unused including <linux/version.h>

Eliminate the follow versioncheck warning:

./fs/ntfs3/inode.c: 16 linux/version.h not needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

abfeb2ee 18-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

fs/ntfs3: Fix fall-through warnings for Clang

Fix the following fallthrough warnings:

fs/ntfs3/inode.c:1792:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/ntfs3/index.c:178:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]

This helps with the ongoing efforts to globally enable
-Wimplicit-fallthrough for Clang.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

be87e821 17-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix one none utf8 char in source file

In one source file there is for some reason non utf8 char. But hey this
is fs development so this kind of thing might happen.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

8c01308b 16-Aug-2021 Nathan Chancellor <nathan@kernel.org>

fs/ntfs3: Remove unused variable cnt in ntfs_security_init()

Clang warns:

fs/ntfs3/fsntfs.c:1874:9: warning: variable 'cnt' set but not used
[-Wunused-but-set-variable]
size_t cnt, off;
^
1 warning generated.

It is indeed unused so remove it.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

71eeb6ac 16-Aug-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Fix integer overflow in multiplication

The multiplication of the u32 data_size with a int is being performed
using 32 bit arithmetic however the results is being assigned to the
variable nbits that is a size_t (64 bit) value. Fix a potential
integer overflow by casting the u32 value to a size_t before the
multiply to use a size_t sized bit multiply operation.

Addresses-Coverity: ("Unintentional integer overflow")
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

87790b65 16-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add ifndef + define to all header files

Add guards so that compiler will only include header files once.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

528c9b3d 16-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use linux/log2 is_power_of_2 function

We do not need our own implementation for this function in this
driver. It is much better to use generic one.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

f8d87ed9 16-Aug-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Fix various spelling mistakes

There is a spelling mistake in a ntfs_err error message. Also
fix various spelling mistakes in comments.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9a10867a 27-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add task-refs-get helper

As we have a more complicated task referencing, which apart from normal
task references includes taking tctx->inflight and caching all that, it
would be a good idea to have all that isolated in helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9114d037f1c195897aa13f38a496078eca2afdb.1630023531.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8295b98 27-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix failed linkchain code logic

Given a linkchain like this:
req0(link_flag)-->req1(link_flag)-->...-->reqn(no link_flag)

There is a problem:
- if some intermediate linked req like req1 's submittion fails, reqs
after it won't be cancelled.

- sqpoll disabled: maybe it's ok since users can get the error info
of req1 and stop submitting the following sqes.

- sqpoll enabled: definitely a problem, the following sqes will be
submitted in the next round.

The solution is to refactor the code logic to:
- if a linked req's submittion fails, just mark it and the head(if it
exists) as REQ_F_FAIL. Leverage req->result to indicate whether it
is failed or cancelled.
- submit or fail the whole chain when we come to the end of it.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210827094609.36052-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

14afdd6e 27-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: remove redundant req_set_fail()

req_set_fail() in io_submit_sqe() is redundant, remove it.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210827094609.36052-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

20ec197b 29-Mar-2021 David Howells <dhowells@redhat.com>

fscache: Use refcount_t for the cookie refcount instead of atomic_t

Use refcount_t for the fscache_cookie refcount instead of atomic_t and
rename the 'usage' member to 'ref' in such cases. The tracepoints that
reference it change from showing "u=%d" to "r=%d".

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431204358.2908479.8006938388213098079.stgit@warthog.procyon.org.uk/

33cba859 18-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Fix fscache_cookie_put() to not deref after dec

fscache_cookie_put() accesses the cookie it has just put inside the
tracepoint that monitors the change - but this is something it's not
allowed to do if we didn't reduce the count to zero.

Fix this by dropping most of those values from the tracepoint and grabbing
the cookie debug ID before doing the dec.

Also take the opportunity to switch over the usage and where arguments on
the tracepoint to put the reason last.

Fixes: a18feb55769b ("fscache: Add tracepoints")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431203107.2908479.3259582550347000088.stgit@warthog.procyon.org.uk/

35b72573 17-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Fix cookie key hashing

The current hash algorithm used for hashing cookie keys is really bad,
producing almost no dispersion (after a test kernel build, ~30000 files
were split over just 18 out of the 32768 hash buckets).

Borrow the full_name_hash() hash function into fscache to do the hashing
for cookie keys and, in the future, volume keys.

I don't want to use full_name_hash() as-is because I want the hash value to
be consistent across arches and over time as the hash value produced may
get used on disk.

I can also optimise parts of it away as the key will always be a padded
array of aligned 32-bit words.

Fixes: ec0328e46d6e ("fscache: Maintain a catalogue of allocated cookies")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431201844.2908479.8293647220901514696.stgit@warthog.procyon.org.uk/

8beabdde 19-Oct-2020 David Howells <dhowells@redhat.com>

cachefiles: Change %p in format strings to something else

Change plain %p in format strings in cachefiles code to something more
useful, since %p is now hashed before printing and thus no longer matches
the contents of an oops register dump.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/160588476042.3465195.6837847445880367183.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431200692.2908479.9253374494073633778.stgit@warthog.procyon.org.uk/

c97a72de 19-Oct-2020 David Howells <dhowells@redhat.com>

fscache: Change %p in format strings to something else

Change plain %p in format strings in fscache code to something more useful,
since %p is now hashed before printing and thus no longer matches the
contents of an oops register dump.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/160588474843.3465195.5446072310069374803.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431199509.2908479.2950631488219944294.stgit@warthog.procyon.org.uk/

58f386a7 12-May-2021 David Howells <dhowells@redhat.com>

fscache: Remove the object list procfile

Remove the object list procfile from fscache as objects will become
entirely internal to the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431198332.2908479.5847286163455099669.stgit@warthog.procyon.org.uk/

6ae9bd8b 12-May-2021 David Howells <dhowells@redhat.com>

fscache, cachefiles: Remove the histogram stuff

Remove the histogram stuff as it's mostly going to be outdated.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431195953.2908479.16770977195634296638.stgit@warthog.procyon.org.uk/

884a7688 10-Feb-2020 David Howells <dhowells@redhat.com>

fscache: Procfile to display cookies

Add /proc/fs/fscache/cookies to display active cookies.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/158861211871.340223.7223853943667440807.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465771021.1376105.6933857529128238020.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588460994.3465195.16963417803501149328.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431194785.2908479.786917990782538164.stgit@warthog.procyon.org.uk/

2908f5e1 10-Feb-2020 David Howells <dhowells@redhat.com>

fscache: Add a cookie debug ID and use that in traces

Add a cookie debug ID and use that in traces and in procfiles rather than
displaying the (hashed) pointer to the cookie. This is easier to correlate
and we don't lose anything when interpreting oops output since that shows
unhashed addresses and registers that aren't comparable to the hashed
values.

Changes:

ver #2:
- Fix the fscache_op tracepoint to handle a NULL cookie pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/158861210988.340223.11688464116498247790.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465769844.1376105.14119502774019865432.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588459097.3465195.1273313637721852165.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431193544.2908479.17556704572948300790.stgit@warthog.procyon.org.uk/

bdd3c50d 26-Aug-2021 Christoph Hellwig <hch@lst.de>

dax: remove bdev_dax_supported

All callers already have a dax_device obtained from fs_dax_get_by_bdev
at hand, so just pass that to dax_supported() insted of doing another
lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

a384f088 26-Aug-2021 Christoph Hellwig <hch@lst.de>

xfs: factor out a xfs_buftarg_is_dax helper

Refactor the DAX setup code in preparation of removing
bdev_dax_supported.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6c97ec17 26-Aug-2021 Christoph Hellwig <hch@lst.de>

fsdax: improve the FS_DAX Kconfig description and help text

Rename the main option text to clarify it is for file system access,
and add a bit of text that explains how to actually switch a nvdimm
to a fsdax capable state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210826135510.6293-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

1568cb0e 06-Jul-2021 Johannes Berg <johannes.berg@intel.com>

hostfs: support splice_write

There's really no good reason not to, and e.g. trace-cmd
currently requires it for the temporary per-CPU files.
Hook up splice_write just like everyone else does.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

0bcc7ca4 24-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfsd: fix crash on LOCKT on reexported NFSv3

Unlike other filesystems, NFSv3 tries to use fl_file in the GETLK case.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bb0a55bb 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfs: don't allow reexport reclaims

In the reexport case, nfsd is currently passing along locks with the
reclaim bit set. The client sends a new lock request, which is granted
if there's currently no conflict--even if it's possible a conflicting
lock could have been briefly held in the interim.

We don't currently have any way to safely grant reclaim, so for now
let's just deny them all.

I'm doing this by passing the reclaim bit to nfs and letting it fail the
call, with the idea that eventually the client might be able to do
something more forgiving here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b840be2f 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

lockd: don't attempt blocking locks on nfs reexports

As in the v4 case, it doesn't work well to block waiting for a lock on
an nfs filesystem.

As in the v4 case, that means we're depending on the client to poll.
It's probably incorrect to depend on that, but I *think* clients do poll
in practice. In any case, it's an improvement over hanging the lockd
thread indefinitely as we currently are.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f657f8ee 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfs: don't atempt blocking locks on nfs reexports

NFS implements blocking locks by blocking inside its lock method. In
the reexport case, this blocks the nfs server thread, which could lead
to deadlocks since an nfs server thread might be required to unlock the
conflicting lock. It also causes a crash, since the nfs server thread
assumes it can free the lock when its lm_notify lock callback is called.

Ideal would be to make the nfs lock method return without blocking in
this case, but for now it works just not to attempt blocking locks. The
difference is just that the original client will have to poll (as it
does in the v4.0 case) instead of getting a callback when the lock's
available.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

97d8cc20 26-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
"Two memory management fixes for the filesystem"

* tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client:
ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
ceph: correctly handle releasing an embedded cap flush


9b49ceb8 26-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"One more fix that I think qualifies for a late merge. It's a revert of
a one-liner fix that meanwhile got backported to stable kernels and we
got reports from users.

The broken fix prevents creating compressed inline extents, which
could be noticeable on space consumption.

Technically it's a regression as the patch was merged in 5.14-rc1 but
got propagated to several stable kernels and has higher exposure than
a 'typical' development cycle bug"

* tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: compression: don't try to compress if we don't have enough pages"


03b8df8d 21-Aug-2021 Darrick J. Wong <djwong@kernel.org>

iomap: standardize tracepoint formatting and storage

Print all the offset, pos, and length quantities in hexadecimal. While
we're at it, update the types of the tracepoint structure fields to
match the types of the values being recorded in them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

3998f0b8 25-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED

RHBZ: 1994393

If we hit a STATUS_USER_SESSION_DELETED for the Create part in the
Create/QueryDirectory compound that starts a directory scan
we will leak EDEADLK back to userspace and surprise glibc and the application.

Pick this up initiate_cifs_search() and retry a small number of tries before we
return an error to userspace.

Cc: stable@vger.kernel.org
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

38f4910b 23-Aug-2021 Steve French <stfrench@microsoft.com>

cifs: cifs_md4 convert to SPDX identifier

Add SPDX license identifier and replace license boilerplate
for cifs_md4.c

Signed-off-by: Steve French <stfrench@microsoft.com>

42c21973 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: create a MD4 module and switch cifs.ko to use it

MD4 support will likely be removed from the crypto directory, but
is needed for compression of NTLMSSP in SMB3 mounts.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

71c02863 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: fork arc4 and create a separate module for it for cifs and other users

We can not drop ARC4 and basically destroy CIFS connectivity for
almost all CIFS users so create a new forked ARC4 module that CIFS and other
subsystems that have a hard dependency on ARC4 can use.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

76a3c92e 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: remove support for NTLM and weaker authentication algorithms

for SMB1.
This removes the dependency to DES.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

18d04062 10-Aug-2021 Shyam Prasad N <sprasad@microsoft.com>

cifs: enable fscache usage even for files opened as rw

So far, the fscache implementation we had supports only
a small set of use cases. Particularly for files opened
with O_RDONLY.

This commit enables it even for rw based file opens. It
also enables the reuse of cached data in case of mount
option (cache=singleclient) where it is guaranteed that
this is the only client (and server) which operates on
the files. There's also a single line change in fscache.c
to get around a bug seen in fscache.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

7321be26 23-Aug-2021 Steve French <stfrench@microsoft.com>

smb3: fix posix extensions mount option

We were incorrectly initializing the posix extensions in the
conversion to the new mount API.

CC: <stable@vger.kernel.org> # 5.11+
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

d72c7419 17-Aug-2021 Ding Hui <dinghui@sangfor.com.cn>

cifs: fix wrong release in sess_alloc_buffer() failed path

smb_buf is allocated by small_smb_init_no_tc(), and buf type is
CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to
release it in failed path.

Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>

f980d055 16-Aug-2021 Len Baker <len.baker@gmx.com>

CIFS: Fix a potencially linear read overflow

strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated.

Also, the strnlen() call does not avoid the read overflow in the strlcpy
function when a not NUL-terminated string is passed.

So, replace this block by a call to kstrndup() that avoids this type of
overflow and does the same.

Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions")
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

0c6e1d7f 25-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: don't free request to slab

It's not necessary to free the request back to slab when we fail to
get sqe, just move it to state->free_list.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe67f4dd 24-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

pipe: do FASYNC notifications for every pipe IO, not just state changes

It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.

Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.

But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).

The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.

This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.

So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.

Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.

It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").

And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.

Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.

Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a9e6ffbc 05-Aug-2021 Tuo Li <islituo@gmail.com>

ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()

kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
ceph_mdsmap_destroy(m);

In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
kfree(m->m_info[i].export_targets);

To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.

[ jlayton: fix up whitespace damage
only kfree(m->m_info) if it's non-NULL ]

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b2f9fa1f 18-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: correctly handle releasing an embedded cap flush

The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads. It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

18598195 03-Jun-2021 David Howells <dhowells@redhat.com>

cachefiles: Use file_inode() rather than accessing ->f_inode

Use the file_inode() helper rather than accessing ->f_inode directly.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/

a7e20e31 12-May-2021 David Howells <dhowells@redhat.com>

netfs: Move cookie debug ID to struct netfs_cache_resources

Move the cookie debug ID from struct netfs_read_request to struct
netfs_cache_resources and drop the 'cookie_' prefix. This makes it
available for things that want to use netfs_cache_resources without having
a netfs_read_request.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/

4c5e4139 04-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Select netfs stats if fscache stats are enabled

Unconditionally select the stats produced by the netfs lib if fscache stats
are enabled as the former are displayed in the latter's procfile.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/

1266b4a7 25-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: fix double free of 'copied'

Dan reported a new smatch warning [1]
"fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'"

Due to new chunk-based format handling logic, the error path can be
called after kfree(copied).

Set "copied = NULL" after kfree(copied) to fix this.

[1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com

Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

4e965576 24-Aug-2021 Qu Wenruo <wqu@suse.com>

Revert "btrfs: compression: don't try to compress if we don't have enough pages"

This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763.

[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627319f ("btrfs: compression: don't try to compress if we don't
have enough pages").

[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.

- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.

- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.

For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.

[FIX]
Fix it by reverting the offending commit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

aaa4db12 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: accept directly into fixed file table

As done with open opcodes, allow accept to skip installing fd into
processes' file tables and put it directly into io_uring's fixed file
table. Same restrictions and design as for open.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a7083ad5 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: hand code io_accept() fd installing

Make io_accept() to handle file descriptor allocations and installation.
A preparation patch for bypassing file tables.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9445598 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: openat directly into fixed fd table

Instead of opening a file into a process's file table as usual and then
registering the fd within io_uring, some users may want to skip the
first step and place it directly into io_uring's fixed file table.
This patch adds such a capability for IORING_OP_OPENAT and
IORING_OP_OPENAT2.

The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour using normal file tables. If non-zero value is
specified, then it will behave as described and place the file into a
fixed file slot sqe->file_index - 1. A file table should be already
created, the slot should be valid and empty, otherwise the operation
will fail.

Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and
EINVAL on inappropriate fixed tables, and return EBADF on collision with
already registered file.

Note: IOSQE_FIXED_FILE can't be used to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c42dd069 24-Aug-2021 Sishuai Gong <sishuai@purdue.edu>

configfs: fix a race in configfs_lookup()

When configfs_lookup() is executing list_for_each_entry(),
it is possible that configfs_dir_lseek() is calling list_del().
Some unfortunate interleavings of them can cause a kernel NULL
pointer dereference error

Thread 1 Thread 2
//configfs_dir_lseek() //configfs_lookup()
list_del(&cursor->s_sibling);
list_for_each_entry(sd, ...)

Fix this by grabbing configfs_dirent_lock in configfs_lookup()
while iterating ->s_children.

Signed-off-by: Sishuai Gong <sishuai@purdue.edu>
Signed-off-by: Christoph Hellwig <hch@lst.de>

d07f132a 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: fold configfs_attach_attr into configfs_lookup

This makes it more clear what gets added to the dcache and prepares
for an additional locking fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>

899587c8 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: simplify the configfs_dirent_is_ready

Return the error directly instead of using a goto.

Signed-off-by: Christoph Hellwig <hch@lst.de>

417b962d 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: return -ENAMETOOLONG earlier in configfs_lookup

Just like most other file systems: get the simple checks out of the
way first.

Signed-off-by: Christoph Hellwig <hch@lst.de>

f38a032b 24-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: fix I_DONTCACHE

Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads
make it clear that it doesn't work as it should.

Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

158ee7b6 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: mark blkdev_fsync static

blkdev_fsync is only used inside of block_dev.c since the
removal of the raw drіver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824151823.1575100-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2949e842 24-Aug-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

fs: clean up after mandatory file locking support removal

Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes
some operations in functions rw_verify_area().

As these functions are now simplified, do some syntactic clean-up as
follow-up to the removal as well, which was pointed out by compiler
warnings and static analysis.

No functional change.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

72a048c1 20-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: only set IOMAP_F_SHARED when providing a srcmap to a write

While prototyping a free space defragmentation tool, I observed an
unexpected IO error while running a sequence of commands that can be
recreated by the following sequence of commands:

# xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1
# cp --reflink=always file1 file2
# punch-alternating -o 1 file2
# xfs_io -c "funshare 0 10m" file2
fallocate: Input/output error

I then scraped this (abbreviated) stack trace from dmesg:

WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:iomap_write_begin+0x376/0x450
RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297
RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000
RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c
RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50
R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40
R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000
FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0
Call Trace:
iomap_unshare_actor+0x95/0x140
iomap_apply+0xfa/0x300
iomap_file_unshare+0x44/0x60
xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
vfs_fallocate+0x133/0x330
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4b8f79140a

Looking at the iomap tracepoints, I saw this:

iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED
iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags
iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED
console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450

The first time funshare calls ->iomap_begin, xfs sees that the first
block is shared and creates a 128k delalloc reservation in the COW fork.
The delalloc reservation is returned as dstmap, and the shared block is
returned as srcmap. So far so good.

funshare calls ->iomap_begin to try the second block. This time there's
no srcmap (punch-alternating punched it out!) but we still have the
delalloc reservation in the COW fork. Therefore, we again return the
reservation as dstmap and the hole as srcmap. iomap_unshare_iter
incorrectly tries to unshare the hole, which __iomap_write_begin rejects
because shared regions must be fully written and therefore cannot
require zeroing.

Therefore, change the buffered write iomap_begin function not to set
IOMAP_F_SHARED when there isn't a source mapping to read from for the
unsharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

7f024fcd 23-Aug-2021 J. Bruce Fields <bfields@redhat.com>

Keep read and write fds with each nlm_file

We shouldn't really be using a read-only file descriptor to take a write
lock.

Most filesystems will put up with it. But NFS, for example, won't.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

cf30da90 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_LINKAT

IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.

In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7a8721f8 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_SYMLINKAT

IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
and arguments.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

01cfa28a 12-Aug-2021 Christoph Hellwig <hch@lst.de>

block: use the percpu bio cache in __blkdev_direct_IO

Use bio_alloc_kiocb to dip into the percpu cache of bios when the
caller asks for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

394918eb 08-Mar-2021 Jens Axboe <axboe@kernel.dk>

io_uring: enable use of bio alloc cache

Mark polled IO as being safe for dipping into the bio allocation
cache, in case the targeted bio_set has it enabled.

This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to
~3.5M IOPS.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dadebc35 23-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix io_try_cancel_userdata race for iowq

WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.

note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.

Cc: stable@vger.kernel.org
Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e34a02dc 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_MKDIRAT

IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags
and arguments.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45f30dab 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: update do_*() helpers to return ints

Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:

* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

020250f3 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_linkat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with do_renameat2, do_unlinkat, do_mknodat, etc.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8228e2c3 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: add getname_uflags()

There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future. Let's just add a simple helper getname_uflags() that handles
this directly and use it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da2d0ced 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_symlinkat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_mkdnodat(), do_unlinkat(),
do_renameat(), do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7797251b 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_mknodat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_unlinkat(), do_renameat(),
do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

584d3226 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_mkdirat() take struct filename

Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same. This is heavily based on
commit dbea8d345177 ("fs: make do_renameat2() take struct filename").

This behaves like do_unlinkat() and do_renameat2().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0ee50b47 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: change filename_parentat() calling conventions

Since commit 5c31b6cedb675 ("namei: saner calling conventions for
filename_parentat()") filename_parentat() had the following behavior WRT
the passed in struct filename *:

* On error the name is consumed (putname() is called on it);
* On success the name is returned back as the return value;

Now there is a need for filename_create() and filename_lookup() variants
that do not consume the passed filename, and following the same "consume
the name only on error" semantics is proven to be hard to reason about
and result in confusing code.

Hence this preparation change splits filename_parentat() into two: one
that always consumes the name and another that never consumes the name.
This will allow to implement two filename_create() variants in the same
way, and is a consistent and hopefully easier to reason about approach.

Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91ef658f 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: ignore ERR/NULL names in putname()

Supporting ERR/NULL names in putname() makes callers code cleaner, and
is what some other path walking functions already support for the same
reason.

This also removes a few existing IS_ERR checks before putname().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

126180b9 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: IRQ rw completion batching

Employ inline completion logic for read/write completions done via
io_req_task_complete(). If ->uring_lock is contended, just do normal
request completion, but if not, make tctx_task_work() to grab the lock
and do batched inline completions in io_req_task_complete().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f237c30a 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: batch task work locking

Many task_work handlers either grab ->uring_lock, or may benefit from
having it. Move locking logic out of individual handlers to a lazy
approach controlled by tctx_task_work(), so we don't keep doing
tons of mutex lock/unlock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5636c00d 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: flush completions for fallbacks

io_fallback_req_func() doesn't expect anyone creating inline
completions, and no one currently does that. Teach the function to flush
completions preparing for further changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26578cda 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add ->splice_fd_in checks

->splice_fd_in is used only by splice/tee, but no other request checks
it for validity. Add the check for most of request types excluding
reads/writes/sends/recvs, we don't want overhead for them and can leave
them be as is until the field is actually used.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2c5d763c 21-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: add clarifying comment for io_cqring_ev_posted()

We've previously had an issue where overflow flush unconditionally calls
io_cqring_ev_posted() even if it didn't flush any events to the ring,
causing wake and eventfd increment where no new events are available.
Some applications don't like that, see commit b18032bb0a88 for details.

This came up in discussion for another patch recently, hence add a
comment detailing what the relationship between calling the events
posted helper and CQ ring entries is.

Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0bea96f5 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: place fixed tables under memcg limits

Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a1b8a4e 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: limit fixed table size by RLIMIT_NOFILE

Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
that's the first and the simpliest restriction that we should impose.

Cc: stable@vger.kernel.org
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99c8bc52 20-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix lack of protection for compl_nr

coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
may cause problems when accessing in parallel:

say coml_nr > 0

ctx_flush_and put other context
if (compl_nr) get mutex
coml_nr > 0
do flush
coml_nr = 0
release mutex
get mutex
do flush (*)
release mutex

in (*) place, we call io_cqring_ev_posted() and users likely get
no events there. To avoid spurious events, re-check the value when
under the lock.

Fixes: 2c32395d8111 ("io_uring: fix __tctx_task_work() ctx race")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

187f08c1 18-Aug-2021 wangyangbo <wangyangbo@uniontech.com>

io_uring: Add register support for non-4k PAGE_SIZE

Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and
accessing this table relies on each level index from fixed TABLE_SHIFT
(12 - 3) in 4k page case. In order to correctly work in non-4k page,
define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for
2nd-level table entry number.

Signed-off-by: wangyangbo <wangyangbo@uniontech.com>
Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e98e49b2 18-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: extend task put optimisations

Now with IRQ completions done via IRQ, almost all requests freeing
are done from the context of submitter task, so it makes sense to
extend task_put optimisation from io_req_free_batch_finish() to cover
all the cases including task_work by moving it into io_put_task().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

316319e8 19-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: add comments on why PF_EXITING checking is safe

We have two checks of task->flags & PF_EXITING left:

1) In io_req_task_submit(), which is called in task_work and hence always
in the context of the original task. That means that
req->task == current, and hence checking ->flags is totally fine.

2) In io_poll_rewait(), where we need to stop re-arming poll to prevent
it interfering with cancelation. This is only run from task_work as
well, and hence for this case too req->task == current.

Add a comment to both spots detailing that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

79dca184 10-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io-wq: move nr_running and worker_refs out of wqe->lock protection

We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ec3c3d0f 18-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix io_timeout_remove locking

io_timeout_cancel() posts CQEs so needs ->completion_lock to be held,
so grab it in io_timeout_remove().

Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

23a65db8 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: improve same wq polling

Move earlier the check for whether __io_queue_proc() tries to poll
already polled waitqueue, and do the same for the second poll entry, if
any. Shouldn't really matter, but at least it would have a more
predictable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

505657bc 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: reuse io_req_complete_post()

We have io_req_complete_post() to post a CQE and put the request. It
takes care of all synchronisation and is more concise and efficent, so
replace all hancoded occurrences of
"lock; post CQE; unlock; + put_req()" with io_req_complete_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ae421d93 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: better encapsulate buffer select for rw

Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the
callers don't need to hand code it. The number of places where we call
io_put_rw_kbuf() is growing, so saves some pain.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

906c6caa 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise io_prep_linked_timeout()

Linked timeout handling during issuing is heavy, it adds extra
instructions and forces to save the next linked timeout before
io_issue_sqe().

Follwing the same reasoning as in refcounting patches, a request can't
be freed by the time it returns from io_issue_sqe(), so now we don't
need to do io_prep_linked_timeout() in advance, and it can be delayed to
colder paths optimising the generic path.

Also, it should also save quite a lot for requests with linked timeouts
and completed inline on timeout spinlocking + hrtimer_start() +
hrtimer_try_to_cancel() and so on.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0756a869 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: cancel not-armed linked touts separately

Adjust io_disarm_next(), so it can detect if there is a linked but
not-yet-armed timeout and complete/cancel it separately. Will be used in
the following patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4d13d1a4 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: simplify io_prep_linked_timeout

The link test in io_prep_linked_timeout() is pretty bulky, replace it
with a flag. It's better for normal path and linked requests, and also
will be used further for request failing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b97e736a 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill REQ_F_LTIMEOUT_ACTIVE

Instead of handling double consecutive linked timeouts through tricky
flag combinations, just check the submit_state.link during timeout_prep
and fail that case in advance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fd08e530 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise hot path of ltimeout prep

io_prep_linked_timeout() grew too heavy and compiler now refuse to
inline the function. Help it by splitting in two and annotating with
inline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8cb01fac 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: deduplicate cancellation code

IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of
overlap, so extract a helper for request cancellation and use in both.
Also, removes some amount of ugliness because of success_ret.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8576af9 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill not necessary resubmit switch

773af69121ecc ("io_uring: always reissue from task_work context") makes
all resubmission to be made from task_work, so we don't need that hack
with resubmit/not-resubmit switch anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb682099 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise initial ltimeout refcounting

Linked timeouts are never refcounted when it comes to the first call to
__io_prep_linked_timeout(), so save an io_ref_get() and set the desired
value directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

761bcac1 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't inflight-track linked timeouts

Tracking linked timeouts as infligh was needed to make sure that io-wq
is not destroyed by io_uring_cancel_generic() racing with
io_async_cancel_one() accessing it. Now, cancellations issued by linked
timeouts are done in the task context, so it's already synchronised.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

48dcd38d 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise iowq refcounting

If a requests is forwarded into io-wq, there is a good chance it hasn't
been refcounted yet and we can save one req_ref_get() by setting the
refcount number to the right value directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a141dd89 12-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: correct __must_hold annotation

io_req_free_batch() has a __must_hold annotation referencing a
request being passed in, but we're passing in the context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

41a5169c 12-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: code clean for completion_lock in io_arm_poll_handler()

We can merge two spin_unlock() operations to one since we removed some
code not long ago.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f552a27a 11-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: remove files pointer in cancellation functions

When doing cancellation, we use a parameter to indicate where it's from
do_exit or exec. So a boolean value is good enough for this, remove the
struct files* as it is not necessary.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

20e60a38 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: skip request refcounting

As submission references are gone, there is only one initial reference
left. Instead of actually doing atomic refcounting, add a flag
indicating whether we're going to take more refs or doing any other sync
magic. The flag should be set before the request may get used in
parallel.

Together with the previous patch it saves 2 refcount atomics per request
for IOPOLL and IRQ completions, and 1 atomic per req for inline
completions, with some exceptions. In particular, currently, there are
three cases, when the refcounting have to be enabled:
- Polling, including apoll. Because double poll entries takes a ref.
Might get relaxed in the near future.
- Link timeouts, enabled for both, the timeout and the request it's
bound to, because they work in-parallel and we need to synchronise
to cancel one of them on completion.
- When a request gets in io-wq, because it doesn't hold uring_lock and
we need guarantees of submission references.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d5901a3 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove submission references

Requests are by default given with two references, submission and
completion. Completion references are straightforward, they represent
request ownership and are put when a request is completed or so.
Submission references are a bit more trickier. They're needed when
io_issue_sqe() followed deep into the submission stack (e.g. in fs,
block, drivers, etc.), request may have given away for concurrent
execution or already completed, and the code unwinding back to
io_issue_sqe() may be accessing some pieces of our requests, e.g.
file or iov.

Now, we prevent such async/in-depth completions by pushing requests
through task_work. Punting to io-wq is also done through task_works,
apart from a couple of cases with a pretty well known context. So,
there're two cases:
1) io_issue_sqe() from the task context and protected by ->uring_lock.
Either requests return back to io_uring or handed to task_work, which
won't be executed because we're currently controlling that task. So,
we can be sure that requests are staying alive all the time and we don't
need submission references to pin them.

2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of
submission reference is played by io-wq reference, which is put by
io_wq_submit_work(). Hence, it should be fine.

Considering that, we can carefully kill the submission reference.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91c2f697 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove req_ref_sub_and_test()

Soon, we won't need to put several references at once, remove
req_ref_sub_and_test() and @nr argument from io_put_req_deferred(),
and put the rest of the references by hand.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

21c843d5 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move req_ref_get() and friends

Move all request refcount helpers to avoid forward declarations in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

79ebeaee 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: remove IRQ aspect of io_ring_ctx completion lock

We have no hard/soft IRQ users of this lock left, remove any IRQ
disabling/saving and restoring when grabbing this lock.

This is straight forward with no users entering with IRQs disabled
anymore, the only thing to look out for is the waitqueue poll head
lock which nests inside the completion lock. That needs IRQs disabled,
and hence we have to do that now instead of relying on the outer lock
doing so.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

8ef12efe 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run regular file completions from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

89b263f6 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run linked timeouts from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

89850fce 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run timeouts from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Add a timeout_lock to handle the ordering of timeout completions or
cancelations with the timeouts actually triggering.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

62906e89 10-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove file batch-get optimisation

For requests with non-fixed files, instead of grabbing just one
reference, we get by the number of left requests, so the following
requests using the same file can take it without atomics.

However, it's not all win. If there is one request in the middle
not using files or having a fixed file, we'll need to put back the left
references. Even worse if an application submits requests dealing with
different files, it will do a put for each new request, so doubling the
number of atomics needed. Also, even if not used, it's still takes some
cycles in the submission path.

If a file used many times, it rather makes sense to pre-register it, if
not, we may fall in the described pitfall. So, this optimisation is a
matter of use case. Go with the simpliest code-wise way, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6294f368 10-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clean up tctx_task_work()

After recent fixes, tctx_task_work() always does proper spinlocking
before looking into ->task_list, so now we don't need atomics for
->task_state, replace it with non-atomic task_running using the critical
section.

Tide it up, combine two separate block with spinlocking, and always try
to splice in there, so we do less locking when new requests are arriving
during the function execution.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix missing ->task_running reset on task_work_add() failure]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d709043 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline io_poll_remove_waitqs

Inline io_poll_remove_waitqs() into its only user and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90f67366 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove extra argument for overflow flush

Unlike __io_cqring_overflow_flush(), nobody does forced flushing with
io_cqring_overflow_flush(), so removed the argument from it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd0ca2e0 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline struct io_comp_state

Inline struct io_comp_state into struct io_submit_state. They are
already coupled tightly, together with mixed responsibilities it
only brings confusion having them separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb943b82 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: use inflight_entry instead of compl.list

req->compl.list is used to cache freed requests, and so can't overlap in
time with req->inflight_entry. So, use inflight_entry to link requests
and remove compl.list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7255834e 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove redundant args from cache_free

We don't use @tsk argument of io_req_cache_free(), remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c34b025f 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: cache __io_free_req()'d requests

Don't kfree requests in __io_free_req() but put them back into the
internal request cache. That makes allocations more sustainable and will
be used for refcounting optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f56165e6 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_fallback_req_func()

Move io_fallback_req_func() to kill yet another forward declaration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e9dbe221 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise putting task struct

We cache all the reference to task + tctx, so if io_put_task() is
called by the corresponding task itself, we can save on atomics and
return the refs right back into the cache.

It's beneficial for all inline completions, and also iopolling, when
polling and submissions are done by the same task, including
SQPOLL|IOPOLL.

Note: io_uring_cancel_generic() can return refs to the cache as well,
so those should be flushed in the loop for tctx_inflight() to work
right.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

af066f31 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: drop exec checks from io_req_task_submit

In case of on-exec io_uring cancellations, tasks already wait for all
submitted requests to get completed/cancelled, so we don't need to check
for ->in_execve separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bbbca094 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill unused IO_IOPOLL_BATCH

IO_IOPOLL_BATCH is not used, delete it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

58d3be2c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: improve ctx hang handling

If io_ring_exit_work() can't get it done in 5 minutes, something is
going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help
and it may take much of CPU time if there is a lot of workers stuck as
such.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3fddf6d 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: deduplicate open iopoll check

Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat
and openat2 reuse it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

543af3a1 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline io_free_req_deferred

Inline io_free_req_deferred(), there is no reason to keep it separated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9bd2bea 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_rsrc_node_alloc() definition

Move the function together with io_rsrc_node_ref_zero() in the source
file as it is to get rid of forward declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6a290a14 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_put_task() definition

Move the function in the source file as it is to get rid of forward
declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e73c5c7c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: extract a helper for ctx quiesce

Refactor __io_uring_register() by extracting a helper responsible for
ctx queisce. Looks better and will make it easier to add more
optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90291099 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise io_cqring_wait() hot path

Turns out we always init struct io_wait_queue in io_cqring_wait(), even
if it's not used after, i.e. there are already enough of CQEs. And often
it's exactly what happens, for instance, requests may have been
completed inline, or in case of io_uring_enter(submit=N, wait=1).

It shows up in my profiler, so optimise it by delaying the struct init.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com
[axboe: fixed up for new cqring wait]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

282cdc86 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add more locking annotations for submit

Add more annotations for submission path functions holding ->uring_lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a2416e1e 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't halt iopoll too early

IOPOLL users should care more about getting completions for requests
they submitted, but not in "device did/completed something". Currently,
io_do_iopoll() may return a positive number, which will instruct
io_iopoll_check() to break the loop and end the syscall, even if there
is not enough CQEs or none at all.

Don't return positive numbers, so io_iopoll_check() exits only when it
gets an actual error, need reschedule or got enough CQEs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

864ea921 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: refactor io_alloc_req

Replace the main if of io_flush_cached_reqs() with inverted condition +
goto, so all the cases are handled in the same way. And also extract
io_preinit_req() to make it cleaner and easier to refer to.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8724dd8c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io-wq: improve wq_list_add_tail()

Prepare nodes that we're going to add before actually linking them, it's
always safer and costs us nothing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2215bed9 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove unnecessary PF_EXITING check

We prefer nornal task_works even if it would fail requests inside. Kill
a PF_EXITING check in io_req_task_work_add(), task_work_add() handles
well dying tasks, i.e. return error when can't enqueue due to late
stages of do_exit().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ebc11b6c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clean io-wq callbacks

Move io-wq callbacks closer to each other, so it's easier to work with
them, and rename io_free_work() into io_wq_free_work() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c97d8a0f 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: avoid touching inode in rw prep

If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set.
However, for non-reg files io_prep_rw() still will look into inode to
double check, and that's expensive and can be avoided.

The only caveat is that it only currently works with 64+ bit
architectures, see FFS_ISREG, so we should consider that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b191e2df 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: rename io_file_supports_async()

io_file_supports_async() checks whether a file supports nowait
operations, so "async" in the name is misleading. Rename it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ac177053 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline fixed part of io_file_get()

Optimise io_file_get() with registered files, which is in a hot path,
by inlining parts of the function. Saves a function call, and
inefficiencies of passing arguments, e.g. evaluating
(sqe_flags & IOSQE_FIXED_FILE).

It couldn't have been done before as compilers were refusing to inline
it because of the function size.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

042b0d85 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: use kvmalloc for fixed files

Instead of hand-coded two-level tables for registered files, allocate
them with kvmalloc(). In many cases small enough tables are enough, and
so can be kmalloc()'ed removing an extra memory load and a bunch of bit
logic instructions from the hot path. If the table is larger, we trade
off all the pros with a TLB-assisted memory lookup.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5fd46178 06-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: be smarter about waking multiple CQ ring waiters

Currently we only wake the first waiter, even if we have enough entries
posted to satisfy multiple waiters. Improve that situation so that
every waiter knows how much the CQ tail has to advance before they can
be safely woken up.

With this change, if we have N waiters each asking for 1 event and we get
4 completions, then we wake up 4 waiters. If we have N waiters asking
for 2 completions and we get 4 completions, then we wake up the first
two. Previously, only the first waiter would've been woken up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3e9f732 04-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: remove GFP_ATOMIC allocation off schedule out path

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [ 90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [ 90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [ 90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G W 5.14.0-rc4-rt4+ #89
| [ 90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [ 90.202561] Call Trace:
| [ 90.202577] dump_stack_lvl+0x34/0x44
| [ 90.202584] ___might_sleep.cold+0x87/0x94
| [ 90.202588] rt_spin_lock+0x19/0x70
| [ 90.202593] ___slab_alloc+0xcb/0x7d0
| [ 90.202598] ? newidle_balance.constprop.0+0xf5/0x3b0
| [ 90.202603] ? dequeue_entity+0xc3/0x290
| [ 90.202605] ? io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202610] ? pick_next_task_fair+0xb9/0x330
| [ 90.202612] ? __schedule+0x670/0x1410
| [ 90.202615] ? io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202618] kmem_cache_alloc_trace+0x79/0x1f0
| [ 90.202621] io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202625] io_wq_worker_sleeping+0x37/0x50
| [ 90.202628] schedule+0x30/0xd0
| [ 90.202630] schedule_timeout+0x8f/0x1a0
| [ 90.202634] ? __bpf_trace_tick_stop+0x10/0x10
| [ 90.202637] io_wqe_worker+0xfd/0x320
| [ 90.202641] ? finish_task_switch.isra.0+0xd3/0x290
| [ 90.202644] ? io_worker_handle_work+0x670/0x670
| [ 90.202646] ? io_worker_handle_work+0x670/0x670
| [ 90.202649] ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94c821fb 20-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: rebuild nat_bits during umount

If all free_nat_bitmap are available, we can rebuild nat_bits from
free_nat_bitmap entirely during umount, let's make another chance
to reenable nat_bits for image.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

a4b68176 20-Aug-2021 Daeho Jeong <daehojeong@google.com>

f2fs: introduce periodic iostat io latency traces

Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.

So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.

[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

52118743 19-Aug-2021 Daeho Jeong <daehojeong@google.com>

f2fs: separate out iostat feature

Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b661601a 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

lockd: update nlm_lookup_file reexport comment

Update comment to reflect that we *do* allow reexport, whether it's a
good idea or not....

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a81041b7 23-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nlm: minor refactoring

Make this lookup slightly more concise, and prepare for changing how we
look this up in a following patch.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2dc6f19e 22-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nlm: minor nlm_lookup_file argument change

It'll come in handy to get the whole nlm_lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

11e4e66e 23-Aug-2021 aalexandrovich <88376726+aalexandrovich@users.noreply.github.com>

Merge branch 'torvalds:master' into master


0d977e0e 20-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

btrfs: reset replace target device to allocation state on close

This crash was observed with a failed assertion on device close:

BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:

close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));

However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.

To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

58bc6d1b 22-Aug-2021 Stian Skjelstad <stian.skjelstad@gmail.com>

udf_get_extendedattr() had no boundary checks.

When parsing the ExtendedAttr data, malicous or corrupt attribute length
could cause kernel hangs and buffer overruns in some special cases.

Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com
Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

939c7feb 11-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix ordered extent boundary calculation

btrfs_lookup_ordered_extent() is supposed to query the offset in a file
instead of the logical address. Pass the file offset from
submit_extent_page() to calc_bio_boundaries().

Also, calc_bio_boundaries() relies on the bio's operation flag, so move
the call site after setting it.

Fixes: 390ed29b817e ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

11462397 11-Aug-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: do not do preemptive flushing if the majority is global rsv

A common characteristic of the bug report where preemptive flushing was
going full tilt was the fact that the vast majority of the free metadata
space was used up by the global reserve. The hard 90% threshold would
cover the majority of these cases, but to be even smarter we should take
into account how much of the outstanding reservations are covered by the
global block reserve. If the global block reserve accounts for the vast
majority of outstanding reservations, skip preemptive flushing, as it
will likely just cause churn and pain.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

93c60b17 11-Aug-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: reduce the preemptive flushing threshold to 90%

The preemptive flushing code was added in order to avoid needing to
synchronously wait for ENOSPC flushing to recover space. Once we're
almost full however we can essentially flush constantly. We were using
98% as a threshold to determine if we were simply full, however in
practice this is a really high bar to hit. For example reports of
systems running into this problem had around 94% usage and thus
continued to flush. Fix this by lowering the threshold to 90%, which is
a more sane value, especially for smaller file systems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
CC: stable@vger.kernel.org # 5.12+
Fixes: 576fa34830af ("btrfs: improve preemptive background space flushing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3736127a 02-Aug-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: tree-log: check btrfs_lookup_data_extent return value

Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.

Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.

Fix the condition by checking if the value was bellow zero and return
early.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8be2ba2e 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: avoid unnecessarily logging directories that had no changes

There are several cases where when logging an inode we need to log its
parent directories or logging subdirectories when logging a directory.

There are cases however where we end up logging a directory even if it was
not changed in the current transaction, no dentries added or removed since
the last transaction. While this is harmless from a functional point of
view, it is a waste time as it brings no advantage.

One example where this is triggered is the following:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt

$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C

$ touch /mnt/A/foo
$ ln /mnt/A/foo /mnt/B/bar
$ ln /mnt/A/foo /mnt/C/baz

$ sync

$ rm -f /mnt/A/foo
$ xfs_io -c "fsync" /mnt/B/bar

This last fsync ends up logging directories A, B and C, however we only
need to log directory A, as B and C were not changed since the last
transaction commit.

So fix this by changing need_log_inode(), to return false in case the
given inode is a directory and has a ->last_trans value smaller than the
current transaction's ID.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5b9b26f5 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mount

Now that we converted btrfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag. We only need to raise this flag on the btrfs_root_fs_type
filesystem since btrfs_mount_root() is ultimately responsible for
allocating the superblock and is called into from btrfs_mount()
associated with btrfs_fs_type.

The conversion of the btrfs inode operations was straightforward.
Regarding btrfs specific ioctls that perform checks based on inode
permissions only those have been allowed that are not filesystem wide
operations and hence can be reasonably charged against a specific mount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4a8b34af 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: handle ACLs on idmapped mounts

Make the ACL code idmapped mount aware. The POSIX default and POSIX
access ACLs are the only ACLs other than some specific xattrs that take
DAC permissions into account. On an idmapped mount they need to be
translated according to the mount's userns. The main change is done to
__btrfs_set_acl() which is responsible for translating POSIX ACLs to
their final on-disk representation.

The btrfs_init_acl() helper does not need to take the idmapped mount
into account since it is called in the context of file creation
operations (mknod, create, mkdir, symlink, tmpfile) and is used for
btrfs_init_inode_security() to copy POSIX default and POSIX access
permissions from the parent directory. These ACLs need to be inherited
unmodified from the parent directory. This is identical to what we do
for ext4 and xfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6623d9a0 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped INO_LOOKUP_USER ioctl

The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
and has the following restrictions. The main difference between the two
is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
is scoped beneath the file descriptor passed with the ioctl.
Specifically, INO_LOOKUP_USER must adhere to the following restrictions:

- The caller must be privileged over each inode of each path component
for the path they are trying to lookup.

- The path for the subvolume the caller is trying to lookup must be reachable
from the inode associated with the file descriptor passed with the ioctl.

The second condition makes it possible to scope the lookup of the path
to the mount identified by the file descriptor passed with the ioctl.
This allows us to enable this ioctl on idmapped mounts.

Specifically, this is possible because all child subvolumes of a parent
subvolume are reachable when the parent subvolume is mounted. So if the
user had access to open the parent subvolume or has been given the fd
then they can lookup the path if they had access to it provided they
were privileged over each path component.

Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
of a subvolume even though they would otherwise be restricted from doing
so via regular VFS-based lookup.

So think about a parent subvolume with multiple child subvolumes.
Someone could mount he parent subvolume and restrict access to the child
subvolumes by overmounting them with empty directories. At this point
the user can't traverse the child subvolumes and they can't open files
in the child subvolumes. However, they can still learn the path of
child subvolumes as long as they have access to the parent subvolume by
using the INO_LOOKUP_USER ioctl.

The underlying assumption here is that it's ok that the lookup ioctls
can't really take mounts into account other than the original mount the
fd belongs to during lookup. Since this assumption is baked into the
original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

39e1674f 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SUBVOL_SETFLAGS ioctl

Setting flags on subvolumes or snapshots are core features of btrfs. The
SUBVOL_SETFLAGS ioctl is especially important as it allows to make
subvolumes and snapshots read-only or read-write. Allow setting flags on
btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e4fed17a 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls

The SET_RECEIVED_SUBVOL ioctls are used to set information about
a received subvolume. Make it possible to set information about a
received subvolume on idmapped mounts. This is a fairly straightforward
operation since all the permission checking helpers are already capable
of handling idmapped mounts. So we just need to pass down the mount's
userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

aabb34e7 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids

So far we prevented the deletion of subvolumes and snapshots using
subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.

This restriction is necessary on idmapped mounts as this allows
filesystem wide subvolume and snapshot deletions and thus can escape the
scope of what's exposed under the mount identified by the fd passed with
the ioctl.

Deletion by subvolume id works by looking for an alias of the parent of
the subvolume or snapshot to be deleted. The parent alias can be
anywhere in the filesystem. However, as long as the alias of the parent
that is found is the same as the one identified by the file descriptor
passed through the ioctl we can allow the deletion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c4ed533b 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SNAP_DESTROY ioctls

Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.

This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.

This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.

The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.

This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.

In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.

Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:

/* Compile the following program as "delete_by_spec". */

#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;

args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;

ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;

return 0;
}

int main(int argc, char *argv[])
{
int subvolid = 0;

if (argc < 3)
exit(1);

fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);

subvolid = atoi(argv[2]);

fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);

exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"

truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}

With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.

The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4d4340c9 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls

Creating subvolumes and snapshots is one of the core features of btrfs
and is even available to unprivileged users. Make it possible to use
subvolume and snapshot creation on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5474bf40 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: check whether fsgid/fsuid are mapped during subvolume creation

When a new subvolume is created btrfs currently doesn't check whether
the fsgid/fsuid of the caller actually have a mapping in the user
namespace attached to the filesystem. The VFS always checks this to make
sure that the caller's fsgid/fsuid can be represented on-disk. This is
most relevant for filesystems that can be mounted inside user namespaces
but it is in general a good hardening measure to prevent unrepresentable
gid/uid from being written to disk.

Since we want to support idmapped mounts for btrfs ioctls to create
subvolumes in follow-up patches this becomes important since we want to
make sure the fsgid/fsuid of the caller as mapped according to the
idmapped mount can be represented on-disk. Simply add the missing
fsuidgid_has_mapping() line from the VFS may_create() version to
btrfs_may_create().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3bc71ba0 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped permission inode op

Enable btrfs_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d4d09464 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped setattr inode op

Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

98b6ab5f 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped tmpfile inode op

Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5a052108 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped symlink inode op

Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b0b3e44d 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mkdir inode op

Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e93ca491 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped create inode op

Enable btrfs_create() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

72105277 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mknod inode op

Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c020d2ea 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped getattr inode op

Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ca07274c 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped rename inode op

Enable btrfs_rename() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b3b6f5b9 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: handle idmaps in btrfs_new_inode()

Extend btrfs_new_inode() to take the idmapped mount into account when
initializing a new inode. This is just a matter of passing down the
mount's userns. The rest is taken care of in inode_init_owner(). This is
a preliminary patch to make the individual btrfs inode operations
idmapped mount aware.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c2fd68b6 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: add mapping aware lookup helper

Various filesystems rely on the lookup_one_len() helper to lookup a
single path component relative to a well-known starting point. Allow
such filesystems to support idmapped mounts by adding a version of this
helper to take the idmap into account when calling inode_permission().
This change is a required to let btrfs (and other filesystems) support
idmapped mounts.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e7849e33 10-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: sysfs: document structures and their associated files

Sysfs file has grown big. It takes some time to locate the correct
struct attribute to add new files. Create a table and map the struct
attribute to its sysfs path.

Also, fix the comment about the debug sysfs path. And add the comments
to the attributes instead of attribute group, where sysfs file names are
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e4571b8c 06-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: fix NULL pointer dereference when deleting device by invalid id

[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:

# mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
/dev/test/scratch2
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs device remove 3 /mnt/btrfs

Then we have the following kernel NULL pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
btrfs_ioctl+0x18bb/0x3190 [btrfs]
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
? do_user_addr_fault+0x201/0x6a0
? lock_release+0xd2/0x2d0
? __x64_sys_ioctl+0x83/0xb0
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

[CAUSE]
Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().

But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.

In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.

Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.

[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.

Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

63fb5879 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: add asserts on splitting extent_map

We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
we can assume the extent_map is PINNED, not LOGGING, and in the modified
list. Add ASSERT()s to ensure the extent_maps after the split also has the
proper flags set and are in the modified list.

Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0ae79c6f 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix block group alloc_offset calculation

alloc_offset is offset from the start of a block group and @offset is
actually an address in logical space. Thus, we need to consider
block_group->start when calculating them.

Fixes: 011b41bffa3d ("btrfs: zoned: advance allocation pointer after tree log node")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ba86dd9f 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: suppress reclaim error message on EAGAIN

btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
running. The message can fail btrfs/187 and it's unnecessary because we
anyway add it back to the reclaim list.

btrfs_reclaim_bgs_work()
`-> btrfs_relocate_chunk()
`-> btrfs_relocate_block_group()
`-> reloc_chunk_start()
`-> if (fs_info->send_in_progress)
`-> return -EAGAIN

CC: stable@vger.kernel.org # 5.13+
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

77233c2d 09-Aug-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: allow disabling of zone auto reclaim

Automatically reclaiming dirty zones might not always be desired for all
workloads, especially as there are currently still some rough edges with
the relocation code on zoned filesystems.

Allow disabling zone auto reclaim on a per filesystem basis by writing 0
as the threshold value.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1f295373 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: update comment at log_conflicting_inodes()

A comment at log_conflicting_inodes() mentions that we check the inode's
logged_trans field instead of using btrfs_inode_in_log() because the field
last_log_commit is not updated when we log that an inode exists and the
inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
about the full sync flag is not true anymore since commit 9acc8103ab594f
("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
update the comment to not mention that part anymore.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d135a533 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove no longer needed full sync flag check at inode_logged()

Now that we are checking if the inode's logged_trans is 0 to detect the
possibility of the inode having been evicted and reloaded, the test for
the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
tree-log.c:inode_logged(). Its purpose was to detect the possibility
of a previous eviction as well, since when an inode is loaded the full
sync flag is always set on it (and only cleared after the inode is
logged).

So just remove the check and update the comment. The check for the inode's
logged_trans being 0 was added recently by the patch with the subject
"btrfs: eliminate some false positives when checking if inode was logged".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1c167b87 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove unnecessary NULL check for the new inode during rename exchange

At the very end of btrfs_rename_exchange(), in case an error happened, we
are checking if 'new_inode' is NULL, but that is not needed since during a
rename exchange, unlike regular renames, 'new_inode' can never be NULL,
and if it were, we would have a crashed much earlier when we dereference it
multiple times.

So remove the check because it is not necessary and because it is causing
static checkers to emit a warning. I probably introduced the check by
copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
NULL, in commit 86e8aa0e772cab ("Btrfs: unpin logs if rename exchange
operation fails").

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

dce28150 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate backref_ctx on stack in find_extent_clone

Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
on stack. The size is reasonably small.

sizeof(backref_ctx) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c853a578 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate btrfs_ioctl_defrag_range_args on stack

Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_defrag_range_args) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0afb603a 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate btrfs_ioctl_quota_rescan_args on stack

Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_quota_rescan_args) = 64

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

98caf953 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate file_ra_state on stack in readahead_cache

Instead of allocating file_ra_state using kmalloc, allocate on stack.
sizeof(struct readahead) = 32 bytes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0ff40a91 29-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: introduce btrfs_search_backwards function

It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.

Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.

The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ea3dc7d2 28-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: print if fsverity support is built in when loading module

As fsverity support depends on a config option, print that at module
load time like we do for similar features.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

70524253 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: verity metadata orphan items

Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

14605409 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: initial fsverity support

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

77eea05e 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: add ro compat flags to inodes

Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).

Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

efc222f8 27-Jul-2021 Anand Jain <anand.jain@oracle.com>

btrfs: simplify return values in btrfs_check_raid_min_devices

Function btrfs_check_raid_min_devices() returns error code from the enum
btrfs_err_code and it starts from 1. So there is no need to check if ret
is > 0. So drop this check and also drop the local variable ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7361b4ae 28-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove the dead comment in writepage_delalloc()

When btrfs_run_delalloc_range() failed, we will error out.

But there is a strange comment mentioning that
btrfs_run_delalloc_range() could have returned value >0 to indicate the
IO has already started.

Commit 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack
usage") introduced the comment, but unfortunately at that time, we were
already using @page_started to indicate that case, and still return 0.

Furthermore, even if that comment was right (which is not), we would
return -EIO if the IO had already started.

By all means the comment is incorrect, just remove the comment along
with the dead check.

Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
make sure we either return 0 or error, no positive return value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b2f78e88 22-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: allow degenerate raid0/raid10

The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.

Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.

The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.

The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.

Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.

Example output:

# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no

Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB

# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB

Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bd54f381 27-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: do not pin logs too early during renames

During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.

In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.

So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.

The following script that uses dbench was used for testing:

$ cat dbench-test.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 120 16

umount $MNT

The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
and using a non-debug kernel config (Debian's default config).

The results compare a branch without this patch and without the previous
patch in the series, that has the subject:

"btrfs: eliminate some false positives when checking if inode was logged"

Versus the same branch with these two patches applied.

dbench with 8 clients, results before:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4391359 0.009 249.745
Close 3225882 0.001 3.243
Rename 185953 0.065 240.643
Unlink 886669 0.049 249.906
Deltree 112 2.455 217.433
Mkdir 56 0.002 0.004
Qpathinfo 3980281 0.004 3.109
Qfileinfo 697579 0.001 0.187
Qfsinfo 729780 0.002 2.424
Sfileinfo 357764 0.004 1.415
Find 1538861 0.016 4.863
WriteX 2189666 0.010 3.327
ReadX 6883443 0.002 0.729
LockX 14298 0.002 0.073
UnlockX 14298 0.001 0.042
Flush 307777 2.447 303.663

Throughput 1149.6 MB/sec 8 clients 8 procs max_latency=303.666 ms

dbench with 8 clients, results after:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4269920 0.009 213.532
Close 3136653 0.001 0.690
Rename 180805 0.082 213.858
Unlink 862189 0.050 172.893
Deltree 112 2.998 218.328
Mkdir 56 0.002 0.003
Qpathinfo 3870158 0.004 5.072
Qfileinfo 678375 0.001 0.194
Qfsinfo 709604 0.002 0.485
Sfileinfo 347850 0.004 1.304
Find 1496310 0.017 5.504
WriteX 2129613 0.010 2.882
ReadX 6693066 0.002 1.517
LockX 13902 0.002 0.075
UnlockX 13902 0.001 0.055
Flush 299276 2.511 220.189

Throughput 1187.33 MB/sec 8 clients 8 procs max_latency=220.194 ms

+3.2% throughput, -31.8% max latency

dbench with 16 clients, results before:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5978334 0.028 156.507
Close 4391598 0.001 1.345
Rename 253136 0.241 155.057
Unlink 1207220 0.182 257.344
Deltree 160 6.123 36.277
Mkdir 80 0.003 0.005
Qpathinfo 5418817 0.012 6.867
Qfileinfo 949929 0.001 0.941
Qfsinfo 993560 0.002 1.386
Sfileinfo 486904 0.004 2.829
Find 2095088 0.059 8.164
WriteX 2982319 0.017 9.029
ReadX 9371484 0.002 4.052
LockX 19470 0.002 0.461
UnlockX 19470 0.001 0.990
Flush 418936 2.740 347.902

Throughput 1495.31 MB/sec 16 clients 16 procs max_latency=347.909 ms

dbench with 16 clients, results after:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5711833 0.029 131.240
Close 4195897 0.001 1.732
Rename 241849 0.204 147.831
Unlink 1153341 0.184 231.322
Deltree 160 6.086 30.198
Mkdir 80 0.003 0.021
Qpathinfo 5177011 0.012 7.150
Qfileinfo 907768 0.001 0.793
Qfsinfo 949205 0.002 1.431
Sfileinfo 465317 0.004 2.454
Find 2001541 0.058 7.819
WriteX 2850661 0.017 9.110
ReadX 8952289 0.002 3.991
LockX 18596 0.002 0.655
UnlockX 18596 0.001 0.179
Flush 400342 2.879 293.607

Throughput 1565.73 MB/sec 16 clients 16 procs max_latency=293.611 ms

+4.6% throughput, -16.9% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6e8e777d 27-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: eliminate some false positives when checking if inode was logged

When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:

1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
unnecessarily join a log transaction and do deletion searches in a log
tree that will not find anything. This just adds unnecessary contention
on extent buffer locks;

2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
needed. If the inode was not logged before, we don't need to log it in
LOG_INODE_EXISTS mode.

So just make sure that any false positive only happens when ->logged_trans
has a value of 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

42b5d73b 21-Jul-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: drop unnecessary ASSERT from btrfs_submit_direct()

When on SINGLE block group, btrfs_get_io_geometry() will return "the
size of the block group - the offset of the logical address within the
block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
INT_MAX);".

The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
which both take "int" (now u64 due to the previous patch). So to be
precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
clone_len is already capped by bio.bi_iter.bi_size which is unsigned
int. So the ASSERT is not necessary.

Drop the ASSERT and properly compare submit_len and geom.len in u64.
Then, let the implicit casting to convert it to u64.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

21dda654 21-Jul-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

btrfs: fix argument type of btrfs_bio_clone_partial()

The offset and can never be negative use unsigned int instead of int
type for them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5662c967 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

fs: kill sync_inode

Now that all users of sync_inode() have been deleted, remove
sync_inode().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

25d23cd0 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

9p: migrate from sync_inode to filemap_fdatawrite_wbc

We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b3776305 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking

sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure. If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether. What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e1646070 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: wait on async extents when flushing delalloc

I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem. When we switched to the per-inode in
38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set. This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc. The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all. This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers. This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc. This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

03fe78cc 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc

We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fcdef39c 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: enable a tracepoint when we fail tickets

When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes. This adds the tracpoint
so you can easily get that information.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ac98141d 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: wake up async_delalloc_pages waiters after submit

We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing. The reason for this is we need to see any ordered extents
that were created by our delalloc flushing. However we're waking up
before we do the submit work, which is before we create the ordered
extents. This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely. Fix this by waking us up after we've done the work to
create ordered extents.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

963e4db8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: unify regular and subpage error paths in __extent_writepage()

[BUG]
When running btrfs/160 in a loop for subpage with experimental
compression support, it has a high chance to crash (~20%):

BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:238!
Internal error: Oops - BUG: 0 [#1] SMP
pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
Call trace:
__btrfs_add_ordered_extent+0x550/0x670 [btrfs]
btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
run_delalloc_nocow+0x81c/0x8fc [btrfs]
btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
writepage_delalloc+0xc0/0x1ac [btrfs]
__extent_writepage+0xf4/0x370 [btrfs]
extent_write_cache_pages+0x288/0x4f4 [btrfs]
extent_writepages+0x58/0xe0 [btrfs]
btrfs_writepages+0x1c/0x30 [btrfs]
do_writepages+0x60/0x110
__filemap_fdatawrite_range+0x108/0x170
filemap_fdatawrite_range+0x20/0x30
btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
__btrfs_write_out_cache+0x34c/0x480 [btrfs]
btrfs_write_out_cache+0x144/0x220 [btrfs]
btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
btrfs_sync_fs+0x64/0x1cc [btrfs]
sync_fs_one_sb+0x3c/0x50
iterate_supers+0xcc/0x1d4
ksys_sync+0x6c/0xd0
__arm64_sys_sync+0x1c/0x30
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0x4c/0xd4
do_el0_svc+0x30/0x9c
el0_svc+0x2c/0x54
el0_sync_handler+0x1a8/0x1b0
el0_sync+0x198/0x1c0
---[ end trace 336f67369ae6e0af ]---

[CAUSE]
For subpage case, we can have multiple sectors inside a page, this makes
it possible for __extent_writepage() to have part of its page submitted
before returning.

In btrfs/160, we are using dm-dust to emulate write error, this means
for certain pages, we could have everything running fine, but at the end
of __extent_writepage(), one of the submitted bios fails due to dm-dust.

Then the page is marked Error, and we change @ret from 0 to -EIO.

This makes the caller extent_write_cache_pages() to error out, without
submitting the remaining pages.

Furthermore, since we're erroring out for free space cache, it doesn't
really care about the error and will update the inode and retry the
writeback.

Then we re-run the delalloc range, and will try to insert the same
delalloc range while previous delalloc range is still hanging there,
triggering the above error.

[FIX]
The proper fix is to handle errors from __extent_writepage() properly,
by ending the remaining ordered extent.

But that fix needs the following changes:

- Know at exactly which sector the error happened
Currently __extent_writepage_io() works for the full page, can't
return at which sector we hit the error.

- Grab the ordered extent covering the failed sector

As a hotfix for subpage case, here we unify the error paths in
__extent_writepage().

In fact, the "if (PageError(page))" branch never get executed if @ret is
still 0 for non-subpage cases.

As for non-subpage case, we never submit current page in
__extent_writepage(), but only add current page into bio.
The bio can only get submitted in next page.

Thus we never get PageError() set due to IO failure, thus when we hit
the branch, @ret is never 0.

By simply removing that @ret assignment, we let subpage case ignore the
IO failure, thus only error out for fatal errors just like regular
sectorsize.

So that IO error won't be treated as fatal error not trigger the hanging
OE problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

95ea0486 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: allow read-write for 4K sectorsize on 64K page size systems

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limitations though:

- For now, subpage RW mount is still considered experimental
Thus that mount warning will still be there.

- No compression support
There are still quite some PAGE_SIZE hard coded and quite some call
sites use extent_clear_unlock_delalloc() to unlock locked_page.
This will screw up subpage helpers.

Now for subpage RW mount, no matter what mount option or inode attr is
set, all writes will not be compressed. Although reading compressed
data has no problem.

- No defrag for subpage case
The defrag support for subpage case will come in later patches, which
will also rework the defrag workflow.

- No inline extent will be created
This is mostly due to the fact that filemap_fdatawrite_range() will
trigger more write than the range specified.
In fallocate calls, this behavior can make us to writeback which can
be inlined, before we enlarge the i_size.

This is a very special corner case, and even current btrfs check won't
report error on such inline extent + regular extent.
But considering how much effort has been put to prevent such inline +
regular, I'd prefer to cut off inline extent completely until we have
a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

9d9ea1e6 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix relocation potentially overwriting last page data

[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

mkfs.btrfs -f -s 4k $dev
mount $dev -o nospace_cache $mnt
$fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
sync
btrfs balance start -d $mnt
btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
refs 2 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 1339392 count 1
extent data backref root FS_TREE objectid 259 offset 647168 count 1
item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
block group used 102400 chunk_objectid 256 flags DATA
item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

0 32K 64K 96K 100K 104K
|<------ Extent A ----->| |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update i_size to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [i_size, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

There is a special note for the invalidating part.

Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.

Reading such page can cause a deadlock, as we normally expect a dirty
page to be fully uptodate.

Thus here we flush and wait the data reloc inode before doing the hacked
invalidating. This won't cause extra overhead, as we're going to
writeback the data later anyway.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e3c62324 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix false alert when relocating partial preallocated data extents

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

BTRFS info (device dm-3): balance: start -d
BTRFS info (device dm-3): relocating block group 13631488 flags data
BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

mkfs.btrfs -f -s 4k $dev
mount $dev -o nospace_cache $mnt
xfs_io -f -c "falloc 0 8k" $mnt/file
xfs_io -f -c "pwrite 0 4k" $mnt/file
btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
have EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7c11d0ae 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix a potential use-after-free in writeback helper

[BUG]
There is a possible use-after-free bug when running generic/095.

BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
Faulting instruction address: 0xc000000000283654
c000000000283078 do_raw_spin_unlock+0x88/0x230
c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
c0000000009e0458 end_bio_extent_writepage+0x158/0x270
c000000000b6fd14 bio_endio+0x254/0x270
c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
c000000000b6fd14 bio_endio+0x254/0x270
c000000000b781fc blk_update_request+0x46c/0x670
c000000000b8b394 blk_mq_end_request+0x34/0x1d0
c000000000d82d1c lo_complete_rq+0x11c/0x140
c000000000b880a4 blk_complete_reqs+0x84/0xb0
c0000000012b2ca4 __do_softirq+0x334/0x680
c0000000001dd878 irq_exit+0x148/0x1d0
c000000000016f4c do_IRQ+0x20c/0x240
c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

Thread 1 | Thread 2
--------------------------------+------------------------------------
end_bio_extent_writepage() | btrfs_releasepage()
|- spin_lock_irqsave() | |
|- end_page_writeback() | |
| | |- if (PageWriteback() ||...)
| | |- clear_page_extent_mapped()
| | |- kfree(subpage);
|- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e0467866 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

assertion failed: PagePrivate(page) && page->private
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3403!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
Hardware name: Khadas VIM3 (DT)
Call trace:
assertfail.constprop.0+0x28/0x2c [btrfs]
btrfs_subpage_assert+0x80/0xa0 [btrfs]
btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
btrfs_dirty_pages+0x160/0x270 [btrfs]
btrfs_buffered_write+0x444/0x630 [btrfs]
btrfs_direct_write+0x1cc/0x2d0 [btrfs]
btrfs_file_write_iter+0xc0/0x160 [btrfs]
new_sync_write+0xe8/0x180
vfs_write+0x1b4/0x210
ksys_pwrite64+0x7c/0xc0
__arm64_sys_pwrite64+0x24/0x30
el0_svc_common.constprop.0+0x70/0x140
do_el0_svc+0x28/0x90
el0_svc+0x2c/0x54
el0_sync_handler+0x1a8/0x1ac
el0_sync+0x170/0x180
Code: f0000160 913be042 913c4000 955444bc (d4210000)
---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c8050b3b 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: reject raid56 filesystem and profile conversion

RAID56 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

Disable it for subpage support for now.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e0eefe07 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: allow submit_extent_page() to do bio split

Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.

But this behavior can't handle subpage case at all.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:

- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}

/* Add as many bytes into the bio */
added = btrfs_bio_add_page();

if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}

Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7367253a 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: disable inline extent creation

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage filesystem, it can create inline extent with regular extents:

fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
index 2 namelen 4 name: file
item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
generation 7 type 0 (inline)
inline extent data size 707 ram_bytes 707 compression 0 (none)
item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
generation 7 type 2 (prealloc)
prealloc data disk byte 102346752 nr 4096
prealloc data offset 0 nr 4096

[CAUSE]
For subpage filesystem, the writeback is triggered in page units, which
means, even if we just want to writeback range [16K, 20K) for 64K page
system, we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

9055 trunc from 0x40000 to 0x2c3
9057 falloc from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at
[0, 4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
i_size change.

So here I choose to disable inline extent creation for subpage case as a
workaround. We have done tons of work just to avoid such extent, so I
don't to create an exception just for subpage.

This only affects inline extent creation, subpage has no problem reading
existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cc1d0d93 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix writeback which does not have ordered extent

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c2832898 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: make relocate_one_page() handle subpage case

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
When marking pages dirty and delalloc, we need to check the cluster
extent boundary.
Now we introduce a loop to go extent by extent of a page, until we
either finished the last extent, or reach the page end.

By this, regular sectorsize == PAGE_SIZE can still work as usual, since
we will do that loop only once.

- Iteration start from max(page_start, extent_start)
Since we can have the following case:
| FE B | FE C |<--- File extent D -->|
|<--------- Page --------->|
Thus we can't always start from page_start, but do a
max(page_start, extent_start)

- Iteration end when the cluster is exhausted
Similar to previous case, the last file extent can end before the page
end:
|<--- File extent A --->| FE B | FE C |
|<--------- Page --------->|
In this case, we need to manually exit the loop after we have finished
the last extent of the cluster.

- Reserve metadata space for each extent range
Since now we can hit multiple ranges in one page, we should reserve
metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f47960f4 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: reloc: factor out relocation page read and dirty part

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a6e66e6f 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: rework lzo_decompress_bio() to make it subpage compatible

For the initial subpage support, although we won't support compressed
write, we still need to support compressed read.

But for lzo_decompress_bio() it has several problems:

- The abuse of PAGE_SIZE for boundary detection
For subpage case, we should follow sectorsize to detect the padding
zeros.
Using PAGE_SIZE will cause subpage compress read to skip certain
bytes, and causing read error.

- Too many helper variables
There are half a dozen helper variables, which is only making things
harder to read

This patch will rework lzo_decompress_bio() to make it work for subpage:

- Use sectorsize to do boundary check, while still use PAGE_SIZE for
page switching
This allows us to have the same on-disk format for 4K sectorsize fs,
while take advantage of larger page size.

- Use two main cursors
Only @cur_in and @cur_out is utilized as the main cursor.
The helper variables will only be declared inside the loop, and only 2
helper variables needed.

- Introduce a helper function to copy compressed segment payload
Introduce a new helper, copy_compressed_segment(), to copy a
compressed segment to workspace buffer.
This function will handle the page switching.

Now the net result is, with all the excessive comments and new helper
function, the refactored code is still smaller, and easier to read.

For other decompression code, they have no special padding rule, thus no
need to bother for initial subpage support, but will be refactored to
the same style later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1c3dc173 04-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: rework btrfs_decompress_buf2page()

There are several bugs inside the function btrfs_decompress_buf2page()

- @start_byte doesn't take bvec.bv_offset into consideration
Thus it can't handle case where the target range is not page aligned.

- Too many helper variables
There are tons of helper variables, @buf_offset, @current_buf_start,
@start_byte, @prev_start_byte, @working_bytes, @bytes.
This hurts anyone who wants to read the function.

- No obvious main cursor for the iteartion
A new problem caused by previous problem.

- Comments for parameter list makes no sense
Like @buf_start is the offset to @buf, or offset inside the full
decompressed extent? (Spoiler alert, the later case)
And @total_out acts more like @buf_start + @size_of_buf.

The worst is @disk_start.
The real meaning of it is the file offset of the full decompressed
extent.

This patch will rework the whole function by:

- Add a proper comment with ASCII art to explain the parameter list

- Rework parameter list
The old @buf_start is renamed to @decompressed, to show how many bytes
are already decompressed inside the full decompressed extent.
The old @total_out is replaced by @buf_len, which is the decompressed
data size.
For old @disk_start and @bio, just pass @compressed_bio in.

- Use single main cursor
The main cursor will be @cur_file_offset, to show what's the current
file offset.
Other helper variables will be declared inside the main loop, and only
minimal amount of helper variables:
* offset_inside_decompressed_buf: The only real helper
* copy_start_file_offset: File offset we start memcpy
* bvec_file_offset: File offset of current bvec

Even with all these extensive comments, the final function is still
smaller than the original function, which is definitely a win.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

557023ea 04-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: grab correct extent map for subpage compressed extent read

[BUG]
When subpage compressed read write support is enabled, btrfs/038 always
fails with EIO.

A simplified script can easily trigger the problem:

mkfs.btrfs -f -s 4k $dev
mount $dev $mnt -o compress=lzo

xfs_io -f -c "truncate 118811" $mnt/foo
xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null

sync
btrfs subvolume snapshot -r $mnt $mnt/mysnap1

xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
sync

xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null

sync
btrfs subvolume snapshot -r $mnt $mnt/mysnap2

cat $mnt/mysnap2/foo
# Above cat will fail due to EIO

[CAUSE]
The problem is in btrfs_submit_compressed_read().

When it tries to grab the extent map of the read range, it uses the
following call:

em = lookup_extent_mapping(em_tree,
page_offset(bio_first_page_all(bio)),
fs_info->sectorsize);

The problem is in the page_offset(bio_first_page_all(bio)) part.

The offending inode has the following file extent layout

item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13680640 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 0 nr 0
item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13676544 nr 4096
extent data offset 0 nr 53248 ram 86016
extent compression 2 (lzo)

And the bio passed in has the following parameters:

page_offset(bio_first_page_all(bio)) = 131072
bio_first_bvec_all(bio)->bv_offset = 65536

If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
we will get an extent map for file offset 131072, not 196608.

This means we read uncompressed data from disk, and later decompression
will definitely fail.

[FIX]
Take bv_offset into consideration when trying to grab an extent map.

And add an ASSERT() to ensure we're really getting a compressed extent.

Thankfully this won't affect anything but subpage, thus we only need to
ensure this patch get merged before we enabled basic subpage support.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ca62e85d 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: disable compressed readahead for subpage

For current subpage support, we only support 64K page size with 4K
sector size.

This makes compressed readahead less effective, as maximum compressed
extent size is only 128K, 2x the page size.

On the other hand, the function add_ra_bio_pages() is still assuming
sectorsize == PAGE_SIZE, and code change may affect 4K page size
systems.

So for now, let's disable subpage compressed readahead for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3670e645 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: check if there are compressed extents inside one page

[BUG]
When testing experimental subpage compressed write support, it hits a
NULL pointer dereference inside read path:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
pc : __pi_memcmp+0x28/0x1ec
lr : check_data_csum+0xd0/0x274 [btrfs]
Call trace:
__pi_memcmp+0x28/0x1ec
btrfs_verify_data_csum+0xf4/0x244 [btrfs]
end_bio_extent_readpage+0x1d0/0x6b0 [btrfs]
bio_endio+0x15c/0x1dc
end_workqueue_fn+0x44/0x64 [btrfs]
btrfs_work_helper+0x74/0x250 [btrfs]
process_one_work+0x1d4/0x47c
worker_thread+0x180/0x400
kthread+0x11c/0x120
ret_from_fork+0x10/0x30
Code: 54000261 d100044c d343fd8c f8408403 (f8408424)
---[ end trace 9e2c59f33ea40866 ]---

[CAUSE]
When reading two compressed extents inside the same page, like the
following layout, we trigger above crash:

0 32K 64K
|-------|\\\\\\\|
| \- Compressed extent (A)
\--------- Compressed extent (B)

For compressed read, we don't need to populate its io_bio->csum, as we
rely on compressed_bio->csum to verify the compressed data, and then
copy the decompressed to inode pages.

Normally btrfs_verify_data_csum() skip such page by checking and
clearing its PageChecked flag

But since that flag is still for the full page, when endio for inode
page range [0, 32K) gets executed, it clears PageChecked flag for the
full page.

Then when endio for inode page range [32K, 64K) gets executed, since the
page no longer has PageChecked flag, it just continues checking, even
though io_bio->csum is NULL.

[FIX]
Thankfully there are only two users of PageChecked bit:

- Cow fixup
Since subpage has its own way to trace page dirty (dirty_bitmap) and
ordered bit (ordered_bitmap), it should never trigger cow fixup.

- Compressed read
We can distinguish such read by just checking io_bio->csum.

So just check io_bio->csum before doing the verification to avoid such
NULL pointer dereference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4c37a793 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: reset this_bio_flag to avoid inheriting old flags

In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
compressed extent.

This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
sector for one page, thus @this_bio_flag will only be set at most once.

But for subpage case, after hitting a compressed extent, @this_bio_flag
will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
extent.

This will lead to various read errors, and causing new ASSERT() in
incoming subpage patches, which adds more strict check in
btrfs_submit_compressed_read().

Fix it by declaring @this_bio_flag inside the main loop and reset its
value for each iteration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

214cc184 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: constify and cleanup variables in comparators

Comparators just read the data and thus get const parameters. This
should be also preserved by the local variables, update all comparators
passed to sort or bsearch.

Cleanups:

- unnecessary casts are dropped
- btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
and 'inline' is dropped as the function address is taken

Signed-off-by: David Sterba <dsterba@suse.com>

d58ede8d 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: simplify data stripe calculation helpers

There are two helpers doing the same calculations based on nparity and
ncopies. calc_data_stripes can be simplified into one expression, so far
we don't have profile with both copies and parity, so there's no
effective change. calc_stripe_length should reuse the helper and not
repeat the same calculation.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fe4f46d4 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: merge alloc_device helpers

The device allocation is split to two functions, but one just calls the
other and they're very far in the file. Merge them together.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

500a44c9 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: uninline btrfs_bg_flags_to_raid_index

The helper does a simple translation from block group flags to index to
the btrfs_raid_array table. There's no apparent reason to inline the
function, the translation happens usually once per function and is not
called in a loop.

Making it a proper function saves quite some binary code (x86_64,
release config):

text data bss dec hex filename
1164011 19253 14912 1198176 124860 pre/btrfs.ko
1161559 19253 14912 1195724 123ecc post/btrfs.ko

DELTA: -2451

Also add the const attribute as there are no side effects, this could
help compiler to optimize a few things without the function body.

Signed-off-by: David Sterba <dsterba@suse.com>

6c154ba4 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: tree-checker: add missing stripe checks for raid1c3/4 profiles

The stripe checks for raid1c3/raid1c4 are missing in the sequence in
btrfs_check_chunk_valid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0ac6e06b 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: tree-checker: use table values for stripe checks

There are hardcoded values in several checks regarding chunks and stripe
constraints. We have that defined in the raid table and ought to use it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

809d6902 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: make btrfs_next_leaf static inline

btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
to header to avoid the function call overhead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f41b6ba9 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending

In commit e65f152e4348 ("btrfs: refactor how we finish ordered extent io
for endio functions") there was last caller not using 1 for the uptodate
parameter. Now there's only one, passing 1, so we can remove it and
simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

25c1252a 26-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered

The uptodate parameter should be bool, change the type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a129ffb8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()

Since commit d75855b4518b ("btrfs: Remove
extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
and adds btrfs_writepage_cow_fixup() function, there is no need to
follow the old hook parameters.

Remove the @start and @end hook, since currently the fixup check is full
page check, it doesn't need @start and @end hook.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a7d1c5dc 26-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: introduce btrfs_lookup_match_dir

btrfs_search_slot is called in multiple places in dir-item.c to search
for a dir entry, and then calling btrfs_match_dir_name to return a
btrfs_dir_item.

In order to reduce the number of callers of btrfs_search_slot, create a
common function that looks for the dir key, and if found call
btrfs_match_dir_item_name.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f8ee80de 26-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: remove unneeded return variable in btrfs_lookup_file_extent

We can return from btrfs_search_slot directly which also shows that it
follows the same return value convention.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ad9a9378 13-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritems

After calling btrfs_search_slot is a common practice to check if the
slot found isn't bigger than number of slots in the current leaf, and if
so, search for the same key in the next leaf by calling btrfs_next_leaf,
which calls btrfs_next_old_leaf to do the job.

Calling btrfs_next_item in the same situation would end up in the same
code flow, since

* btrfs_next_item
* btrfs_next_old_item
* if slot >= nritems(curr_leaf)
btrfs_next_old_leaf

Change btrfs_verify_dev_extents and calculate_emulated_zone_size
functions to use btrfs_next_leaf in the same situation.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c7bcbb21 22-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove ignore_offset argument from btrfs_find_all_roots()

Currently all the callers of btrfs_find_all_roots() pass a value of false
for its ignore_offset argument. This makes the argument pointless and we
can remove it and make btrfs_find_all_roots() always pass false as the
ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2ac691d8 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: avoid unnecessary lock and leaf splits when updating inode in the log

During a fast fsync, if we have already fsynced the file before and in the
current transaction, we can make the inode item update more efficient and
avoid acquiring a write lock on the leaf's parent.

To update the inode item we are always using btrfs_insert_empty_item() to
get a path pointing to the inode item, which calls btrfs_search_slot()
with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) +
sizeof(struct btrfs_item)', and that always results in the search taking
a write lock on the level 1 node that is the parent of the leaf that
contains the inode item. This adds unnecessary lock contention on log
trees when we have multiple fsyncs in parallel against inodes in the same
subvolume, which has a very significant impact due to the fact that log
trees are short lived and their height very rarely goes beyond level 2.

Also, by using btrfs_insert_empty_item() when we need to update the inode
item, we also end up splitting the leaf of the existing inode item when
the leaf has an amount of free space smaller than the size of an inode
item.

Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument,
when we know the inode item already exists in the log. This avoids these
two inefficiencies.

The following script, using fio, was used to perform the tests:

$ cat fio-test.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"

if [ $# -ne 4 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
exit 1
fi

NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4

cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
echo "mount options: $MOUNT_OPTIONS"
echo

umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

fio /tmp/fio-job.ini
umount $MNT

The tests were done on a physical machine, with 12 cores, 64G of RAM,
using a NVMEe device and using a non-debug kernel config (the default one
from Debian). The summary line from fio is provided below for each test
run.

With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:

Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec
After: WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec

+1.4% throughput, -1.2% runtime

With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:

Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec
After: WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec

+2.2% throughput, -2.1% runtime

The changes are small but it's possible to be better on faster hardware as
in the test machine used disk utilization was pretty much 100% during the
whole time the tests were running (observed with 'iostat -xz 1').

The tests also included the previous patch with the subject of:
"btrfs: avoid unnecessary log mutex contention when syncing log".
So they compared a branch without that patch and without this patch versus
a branch with these two patches applied.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e68107e5 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove unnecessary list head initialization when syncing log

One of the last steps of syncing the log is to remove all log contexts
from the root's list of contexts, done at btrfs_remove_all_log_ctxs().
There we iterate over all the contexts in the list and delete each one
from the list, and after that we call INIT_LIST_HEAD() on the list. That
is unnecessary since at that point the list is empty.

So just remove the INIT_LIST_HEAD() call. It's not needed, increases code
size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after
this change) and increases two critical sections delimited by log mutexes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e1a6d264 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: avoid unnecessary log mutex contention when syncing log

When syncing the log we acquire the root's log mutex just to update the
root's last_log_commit. This is unnecessary because:

1) At this point there can only be one task updating this value, which is
the task committing the current log transaction. Any task that enters
btrfs_sync_log() has to wait for the previous log transaction to commit
and wait for the current log transaction to commit if someone else
already started it (in this case it never reaches to the point of
updating last_log_commit, as that is done by the committing task);

2) All readers of the root's last_log_commit don't acquire the root's
log mutex. This is to avoid blocking the readers, potentially for too
long and because getting a stale value of last_log_commit does not
cause any functional problem, in the worst case getting a stale value
results in logging an inode unnecessarily. Plus it's actually very
rare to get a stale value that results in unnecessarily logging the
inode.

So in order to avoid unnecessary contention on the root's log mutex,
which is used for several different purposes, like starting/joining a
log transaction and starting writeback of a log transaction, stop
acquiring the log mutex for updating the root's last_log_commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cceaa89f 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove racy and unnecessary inode transaction update when using no-holes

When using the NO_HOLES feature and expanding the size of an inode, we
update the inode's last_trans, last_sub_trans and last_log_commit fields
at maybe_insert_hole() so that a fsync does know that the inode needs to
be logged (by making sure that btrfs_inode_in_log() returns false). This
happens for expanding truncate operations, buffered writes, direct IO
writes and when cloning extents to an offset greater than the inode's
i_size.

However the way we do it is racy, because in between setting the inode's
last_sub_trans and last_log_commit fields, the log transaction ID that was
assigned to last_sub_trans might be committed before we read the root's
last_log_commit and assign that value to last_log_commit. If that happens
it would make a future call to btrfs_inode_in_log() return true. This is
a race that should be extremely unlikely to be hit in practice, and it is
the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race
between marking inode needs to be logged and log syncing").

The fix would simply be to set last_log_commit to the value we assigned
to last_sub_trans minus 1, like it was done in that commit. However
updating these two fields plus the last_trans field is pointless here
because all the callers of btrfs_cont_expand() (which is the only
caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
next fsync will log the inode, as it makes btrfs_inode_in_log() return
false.

So just remove the code that explicitly sets the inode's last_trans,
last_sub_trans and last_log_commit fields.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5a656c36 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool

In commit 351cbf6e4410e7 ("btrfs: use nofs allocations for running delayed
items") we wrapped all btree updates when running delayed items with
memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
detected by lockdep involving reclaim and the mutex of delayed nodes.

The problem is because the ref verify tool does some memory allocations
with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
eviction, which requires locking the mutex of an inode's delayed node.
On the other hand the ref verify tool is called when allocating metadata
extents as part of operations that modify a btree, which is a problem when
running delayed nodes, where we do btree updates while holding the mutex
of a delayed node. This is what caused the lockdep warning.

Instead of wrapping every btree update when running delayed nodes, change
the ref verify tool to never do GFP_KERNEL allocations, because:

1) We get less repeated code, which at the moment does not even have a
comment mentioning why we need to setup the NOFS context, which is a
recommended good practice as mentioned at
Documentation/core-api/gfp_mask-from-fs-io.rst

2) The ref verify tool is something meant only for debugging and not
something that should be enabled on non-debug / non-development
kernels;

3) We may have yet more places outside delayed-inode.c where we have
similar problem: doing btree updates while holding some lock and
then having the GFP_KERNEL memory allocations, from the ref verify
tool, trigger reclaim and trying again to acquire the same lock
through the reclaim path.
Or we could get more such cases in the future, therefore this change
prevents getting into similar cases when using the ref verify tool.

Curiously most of the memory allocations done by the ref verify tool
were already using GFP_NOFS, except a few ones for no apparent reason.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

506650dc 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: improve the batch insertion of delayed items

When we insert the delayed items of an inode, which corresponds to the
directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
do the following:

1) Pick the first delayed item from the rbtree and insert it into the
fs/subvolume btree, using btrfs_insert_empty_item() for that;

2) Without releasing the path returned by btrfs_insert_empty_item(),
keep collecting as many consecutive delayed items from the rbtree
as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
immediate successor of the previously picked item and as long as
they fit in the available space of the leaf the path points to;

3) Then insert all the collected items into the leaf;

4) Release the reserve metadata space for each collected item and
release each item (implies deleting from the rbtree);

5) Unlock the path.

While this is much better than inserting items one by one, it can be
improved in a few aspects:

1) Instead of adding items based on the remaining free space of the
leaf, collect as many items that can fit in a leaf and bulk insert
them. This results in less and larger batches, reducing the total
amount of time to insert the delayed items. For example when adding
100K files to a directory, we ended up creating 1658 batches with
very variable sizes ranging from 1 item to 118 items, on a filesystem
with a node/leaf size of 16K. After this change, we end up with 839
batches, with the vast majority of them having exactly 120 items;

2) We do the search for more items to batch, by iterating the rbtree,
while holding a write lock on the leaf;

3) While still holding the leaf locked, we are releasing the reserved
metadata for each item and then deleting each item, keeping a write
lock on the leaf for longer than necessary. Releasing the delayed items
one by one can take a significant amount of time, because deleting
them from the rbtree can often be a bit slow when the deletion results
in rebalancing the rbtree.

So change this so that we try to create larger batches, with a total
item size up to the maximum a leaf can support, and by unlocking the leaf
immediately after inserting the items, releasing the reserved metadata
space of each item and releasing each item without holding the write lock
on the leaf.

The following script that runs fs_mark was used to test this change:

$ cat test.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
FILES=1000000
THREADS=16
FILE_SIZE=0

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done

fs_mark $OPTS

umount $MNT

It was run on machine with 12 cores, 64G of ram, using a NVMe device and
using a non-debug kernel config (Debian's default config).

Results before this change:

FSUse% Count Size Files/sec App Overhead
1 16000000 0 76182.1 72223046
3 32000000 0 62746.9 80776528
5 48000000 0 77029.0 93022381
6 64000000 0 73691.6 95251075
8 80000000 0 66288.0 85089634

Results after this change:

FSUse% Count Size Files/sec App Overhead
1 16000000 0 79049.5 (+3.7%) 69700824
3 32000000 0 65248.9 (+3.9%) 80583693
5 48000000 0 77991.4 (+1.2%) 90040908
6 64000000 0 75096.8 (+1.9%) 89862241
8 80000000 0 66926.8 (+1.0%) 84429169

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2b29726c 18-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: rescue: allow ibadroots to skip bad extent tree when reading block group items

When extent tree gets corrupted, normally it's not extent tree root, but
one toasted tree leaf/node.

In that case, rescue=ibadroots mount option won't help as it can only
handle the extent tree root corruption.

This patch will enhance the behavior by:

- Allow fill_dummy_bgs() to ignore -EEXIST error

This means we may have some block group items read from disk, but
then hit some error halfway.

- Fallback to fill_dummy_bgs() if any error gets hit in
btrfs_read_block_groups()

Of course, this still needs rescue=ibadroots mount option.

With that, rescue=ibadroots can handle extent tree corruption more
gracefully and allow a better recover chance.

Reported-by: Zhenyu Wu <wuzy001@gmail.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html
Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6534c0c9 20-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: pass NULL as trans to btrfs_search_slot if we only want to search

Using a transaction in btrfs_search_slot is only useful when we are
searching to add or modify the tree. When the function is used for
searching, insert length and mod arguments are 0, there is no need to
use a transaction.

No functional changes, changing for consistency.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

069a2e37 20-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: continue readahead of siblings even if target node is in memory

At reada_for_search(), when attempting to readahead a node or leaf's
siblings, we skip the readahead of the siblings if the node/leaf is
already in memory. That is probably fine for the READA_FORWARD and
READA_BACK readahead types, as they are used on contexts where we
end up reading some consecutive leaves, but usually not the whole btree.

However for a READA_FORWARD_ALWAYS mode, currently only used for full
send operations, it does not make sense to skip the readahead if the
target node or leaf is already loaded in memory, since we know the caller
is visiting every node and leaf of the btree in ascending order.

So change the behaviour to not skip the readahead when the target node is
already in memory and the readahead mode is READA_FORWARD_ALWAYS.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk, with 32GiB of RAM and
using a non-debug kernel config (Debian's default config).

$ cat test.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit

mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT

# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))

echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done

wait ${worker_pids[@]}

sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}

file_count=2000000
add_files $file_count 0 4

echo
echo "Creating snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1

umount $MNT

echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null

mount $MOUNT_OPTIONS $DEV $MNT

echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"

umount $MNT

The duration of the full send operations, in seconds, were the following:

Before this change: 85 seconds
After this change: 76 seconds (-11.2%)

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5da38479 15-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: check-integrity: drop kmap/kunmap for block pages

The pages in block_ctx have never been allocated from highmem (in
btrfsic_read_block) so the mapping is pointless and can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>

4c2bf276 15-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: compression: drop kmap/kunmap from generic helpers

The pages in compressed_pages are not from highmem anymore so we can
drop the mapping for checksum calculation and inline extent.

Signed-off-by: David Sterba <dsterba@suse.com>

bbaf9715 14-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: compression: drop kmap/kunmap from zstd

As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>

696ab562 14-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: compression: drop kmap/kunmap from zlib

As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>

8c945d32 14-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: compression: drop kmap/kunmap from lzo

As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>

b0ee5e1e 14-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: drop from __GFP_HIGHMEM all allocations

The highmem flag is used for allocating pages for compression and for
raid56 pages. The high memory makes sense on 32bit systems but is not
without problems. On 64bit system's it's just another layer of wrappers.

The time the pages are allocated for compression or raid56 is relatively
short (about a transaction commit), so the pages are not blocked
indefinitely. As the number of pages depends on the amount of data being
written/read, there's a theoretical problem. A fast device on a 32bit
system could use most of the low memory pool, while with the highmem
allocation that would not happen. This was possibly the original idea
long time ago, but nowadays we optimize for 64bit systems.

This patch removes all usage of the __GFP_HIGHMEM flag for page
allocation, the kmap/kunmap are still in place and will be removed in
followup patches. Remaining is masking out the bit in
alloc_extent_state and __lookup_free_space_inode, that can safely stay.

Signed-off-by: David Sterba <dsterba@suse.com>

23608d51 04-Jul-2021 Anand Jain <anand.jain@oracle.com>

btrfs: cleanup fs_devices pointer usage in btrfs_trim_fs

Drop variable 'devices' (used only once) and add new variable for
the fs_devices, so it is used at two locations within btrfs_trim_fs()
function and also helps to access fs_devices->devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

67d5e289 06-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: remove max argument from generic_bin_search

Both callers use btrfs_header_nritems to feed the max argument. Remove
the argument and let generic_bin_search call it itself.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2eadb9e7 04-Jul-2021 Nikolay Borisov <nborisov@suse.com>

btrfs: make btrfs_finish_chunk_alloc private to block-group.c

One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.

Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4a9531cf 05-Jul-2021 Anand Jain <anand.jain@oracle.com>

btrfs: check-integrity: drop unnecessary function prototypes

The function prototypes below aren't necessary as the functions are
first defined before called. Remove them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b3b7e1d0 23-Jun-2021 David Sterba <dsterba@suse.com>

btrfs: add special case to setget helpers for 64k pages

On 64K pages the size of the extent_buffer::pages array is 1 and
compilation with -Warray-bounds warns due to

kaddr = page_address(eb->pages[idx + 1]);

when reading byte range crossing page boundary.

This does never actually overflow the array because on 64K because all
the data fit in one page and bounds are checked by check_setget_bounds.

To fix the reported overflows and warnings add a compile-time condition
that will allow compiler to eliminate the dead code that reads from the
idx + 1 page.

Link: https://lore.kernel.org/lkml/20210623083901.1d49d19d@canb.auug.org.au/
CC: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>

5a80d1c6 28-Jun-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: remove max_zone_append_size logic

There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c107a ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dce642 ("dm
crypt: Fix zoned block device support").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f7e33bdb 19-Aug-2021 Jeff Layton <jlayton@kernel.org>

fs: remove mandatory file locking support

We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>

6e7c1770 14-Jul-2021 Christoph Hellwig <hch@lst.de>

fs: simplify get_filesystem_list / get_all_fs_names

Just output the '\0' separate list of supported file systems for block
devices directly rather than going through a pointless round of string
manipulation.

Based on an earlier patch from Al Viro <viro@zeniv.linux.org.uk>.

Vivek:
Modified list_bdev_fs_names() and split_fs_names() to return number of
null terminted strings to caller. Callers now use that information to
loop through all the strings instead of relying on one extra null char
being present at the end.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2f488f69 02-Jul-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

fcntl: fix potential deadlock for &fasync_struct.fa_lock

There is an existing lock hierarchy of
&dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock
from the following call chain:

input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);

&dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled
while grabbing &fasync_struct.fa_lock, otherwise we invert the lock
hierarchy. However, since kill_fasync which calls kill_fasync_rcu is
an exported symbol, it may not necessarily be called with interrupts
disabled.

As kill_fasync_rcu may be called with interrupts disabled (for
example, in the call chain above), we replace calls to
read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu
with read_lock_irqsave/read_unlock_irqrestore.

Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

f671a691 02-Jul-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

fcntl: fix potential deadlocks for &fown_struct.lock

Syzbot reports a potential deadlock in do_fcntl:

========================================================
WARNING: possible irq lock inversion dependency detected
5.12.0-syzkaller #0 Not tainted
--------------------------------------------------------
syz-executor132/8391 just changed the state of lock:
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: f_getown_ex fs/fcntl.c:211 [inline]
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: do_fcntl+0x8b4/0x1200 fs/fcntl.c:395
but this lock was taken by another, HARDIRQ-safe lock in the past:
(&dev->event_lock){-...}-{2:2}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
Chain exists of:
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&f->f_owner.lock);
local_irq_disable();
lock(&dev->event_lock);
lock(&new->fa_lock);
<Interrupt>
lock(&dev->event_lock);

*** DEADLOCK ***

This happens because there is a lock hierarchy of
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
from the following call chain:

input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);

However, since &dev->event_lock is HARDIRQ-safe, interrupts have to be
disabled while grabbing &f->f_owner.lock, otherwise we invert the lock
hierarchy.

Hence, we replace calls to read_lock/read_unlock on &f->f_owner.lock,
with read_lock_irq/read_unlock_irq.

Reported-and-tested-by: syzbot+e6d5398a02c516ce5e70@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

15517c72 21-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull mandatory file locking deprecation warning from Jeff Layton:
"As discussed on the list, this patch just adds a new warning for folks
who still have mandatory locking enabled and actually mount with '-o
mand'. I'd like to get this in for v5.14 so we can push this out into
stable kernels and hopefully reach folks who have mounts with -o mand.

For now, I'm operating under the assumption that we'll fully remove
this support in v5.15, but we can move that out if any legitimate
users of this facility speak up between now and then"

* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: warn about impending deprecation of mandatory locks


7de875b2 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

lockd: lockd server-side shouldn't set fl_ops

Locks have two sets of op arrays, fl_lmops for the lock manager (lockd
or nfsd), fl_ops for the filesystem. The server-side lockd code has
been setting its own fl_ops, which leads to confusion (and crashes) in
the reexport case, where the filesystem expects to be the only one
setting fl_ops.

And there's no reason for it that I can see-the lm_get/put_owner ops do
the same job.

Reported-by: Daire Byrne <daire@dneg.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

1e6907d58 21-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"A few small fixes that should go into this release:

- Fix never re-assigning an initial error value for io_uring_enter()
for SQPOLL, if asked to do nothing

- Fix xa_alloc_cycle() return value checking, for cases where we have
wrapped around

- Fix for a ctx pin issue introduced in this cycle (Pavel)"

* tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
io_uring: fix xa_alloc_cycle() error return value check
io_uring: pin ctx on fallback execution
io_uring: only assign io_uring_enter() SQPOLL error in actual error case


e70e392f 21-Aug-2021 Namjae Jeon <namjae.jeon@samsung.com>

ksmbd: fix permission check issue on chown and chmod

When commanding chmod and chown on cifs&ksmbd, ksmbd allows it without file
permissions check. There is code to check it in settattr_prepare.
Instead of setting the inode directly, update the mode and uid/gid
through notify_change.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

fdd92b64 20-Aug-2021 Jeff Layton <jlayton@kernel.org>

fs: warn about impending deprecation of mandatory locks

We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>

a30f895a 20-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: fix xa_alloc_cycle() error return value check

We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.

Cc: stable@vger.kernel.org
Fixes: 61cf93700fe6 ("io_uring: Convert personality_idr to XArray")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

61e0d0cc 19-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: fix perag structure refcounting error when scrub fails

The kernel test robot found the following bug when running xfs/355 to
scrub a bmap btree:

XFS: Assertion failed: !sa->pag, file: fs/xfs/scrub/common.c, line: 412
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 1415 Comm: xfs_scrub Not tainted 5.14.0-rc4-00021-g48c6615cc557 #1
Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
RIP: 0010:assfail+0x23/0x28 [xfs]
RSP: 0018:ffffc9000aacb890 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc9000aacbcc8 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc09e7dcd
RBP: ffffc9000aacbc80 R08: ffff8881fdf17d50 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: 0000000000000000
R13: ffff88820c7ed000 R14: 0000000000000001 R15: ffffc9000aacb980
FS: 00007f185b955700(0000) GS:ffff8881fdf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7f6ef43000 CR3: 000000020de38002 CR4: 00000000001706e0
Call Trace:
xchk_ag_read_headers+0xda/0x100 [xfs]
xchk_ag_init+0x15/0x40 [xfs]
xchk_btree_check_block_owner+0x76/0x180 [xfs]
xchk_btree_get_block+0xd0/0x140 [xfs]
xchk_btree+0x32e/0x440 [xfs]
xchk_bmap_btree+0xd4/0x140 [xfs]
xchk_bmap+0x1eb/0x3c0 [xfs]
xfs_scrub_metadata+0x227/0x4c0 [xfs]
xfs_ioc_scrub_metadata+0x50/0xc0 [xfs]
xfs_file_ioctl+0x90c/0xc40 [xfs]
__x64_sys_ioctl+0x83/0xc0
do_syscall_64+0x3b/0xc0

The unusual handling of errors while initializing struct xchk_ag is the
root cause here. Since the beginning of xfs_scrub, the goal of
xchk_ag_read_headers has been to read all three AG header buffers and
attach them both to the xchk_ag structure and the scrub transaction.
Corruption errors on any of the three headers doesn't necessarily
trigger an immediate return to userspace, because xfs_scrub can also
tell us to /fix/ the problem.

In other words, it's possible for the xchk_ag init functions to return
an error code and a partially filled out structure so that scrub can use
however much information it managed to pull. Before 5.15, it was
sufficient to cancel (or commit) the scrub transaction on the way out of
the scrub code to release the buffers.

Ccommit 48c6615cc557 added a reference to the perag structure to struct
xchk_ag. Since perag structures are not attached to transactions like
buffers are, this adds the requirement that the perag ref be released
explicitly. The scrub teardown function xchk_teardown was amended to do
this for the xchk_ag embedded in struct xfs_scrub.

Unfortunately, I forgot that certain parts of the scrub code probe
multiple AGs and therefore handle the initialization and cleanup on
their own. Specifically, the bmbt scrubber will initialize it long
enough to cross-reference AG metadata for btree blocks and for the
extent mappings in the bmbt.

If one of the AG headers is corrupt, the init function returns with a
live perag structure reference and some of the AG header buffers. If an
error occurs, the cross referencing will be noted as XCORRUPTion and
skipped, but the main scrub process will move on to the next record.
It is now necessary to release the perag reference before we try to
analyze something from a different AG, or else we'll trip over the
assertion noted above.

Fixes: 48c6615cc557 ("xfs: grab active perag ref when reading AG headers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

c5aa903a 20-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: support reading chunk-based uncompressed files

Add runtime support for chunk-based uncompressed files
described in the previous patch.

Link: https://lore.kernel.org/r/20210820100019.208490-2-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

2a9dc7a8 20-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: introduce chunk-based file on-disk format

Currently, uncompressed data except for tail-packing inline is
consecutive on disk.

In order to support chunk-based data deduplication, add a new
corresponding inode data layout.

In the future, the data source of chunks can be either (un)compressed.

Link: https://lore.kernel.org/r/20210820100019.208490-1-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

08d73666 02-Aug-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Remove redundant check from gfs2_glock_dq

In function gfs2_glock_dq, it checks to see if this is the fast path.
Before this patch, it checked both "find_first_holder(gl) == NULL" and
list_empty(&gl->gl_holders), which is redundant. If gl_holders is empty
then find_first_holder must return NULL. This patch removes the
redundancy.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

fffe9bee 30-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Delay withdraw from atomic context

Before this patch, if function __gfs2_ail_flush detected an error
syncing the ail list, it call gfs2_ail_error which called gfs2_withdraw.
Since __gfs2_ail_flush deals with a specific glock, we shouldn't withdraw
immediately because the withdraw code (signal_our_withdraw) uses glocks
in its processing.

This patch changes the call from gfs2_withdraw to gfs2_withdraw_delayed
which defers the withdraw until a more appropriate context, such as the
logd daemon, discovers the intent to withdraw.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

d1340f80 29-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Don't call dlm after protocol is unmounted

In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call
to lm_unmount. After a withdraw, users are allowed to unmount the
withdrawn file system. But at that point we may still have glocks left
over that we need to free via unmount's call to gfs2_gl_hash_clear.
These glocks may have never been completed because of whatever problem
caused the withdraw (IO errors or whatever).

Before this patch, function gdlm_put_lock would still try to call into
dlm to unlock these leftover glocks, which resulted in dlm returning
-EINVAL because the lock space was abandoned. These glocks were never
freed because there was no mechanism after that to free them.

This patch adds a check to gdlm_put_lock to see if the locking protocol
was inactive (DFL_UNMOUNT flag) and if so, free the glock and not
make the invalid call into dlm.

I could have combined this "if" with the one that follows, related to
leftover glock LVBs, but I felt the code was more readable with its own
if clause.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

8cc67f70 29-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: don't stop reads while withdraw in progress

When gfs2 withdraws a file system, it calls signal_our_withdraw which
triggers another node to replay the withdrawing node's journal. Then it
waits until it knows the journal has been replayed. Part of this wait is
to repeatedly call check_journal_clean which calls gfs2_jdesc_check,
which checks to see if the journal is sane. As part of its sanity checks
it needs to re-read its journal's metadata. But with today's code, any
attempt to re-read the metadata results in -EIO because of a check for
the file system withdraw in function gfs2_meta_wait.

This patch adds an additional check for SDF_WITHDRAW_IN_PROG, to tell
if the read is done while the withdraw is in progress. In that case
we allow the metadata read to not be rejected. Therefore the metadata
check is done properly, so the withdraw sequence can finish normally.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

1b8550b5 29-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Mark journal inodes as "don't cache"

Before this patch, journal inodes were considered regular inodes,
which meant that instead of evicting them, function iput_final would
just put them on the lru for later processing. If the file system
withdrew for whatever reason, the withdraw would never be seen until
the inode was evicted, which could be indefinitely.

This patch marks all journal inodes as "don't cache" which means
function iput_final will evict them immediately, allowing us to
properly recover the journal on other cluster nodes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

ba3ca2bc 29-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: nit: gfs2_drop_inode shouldn't return bool

Today, gfs2_drop_inode can return "false" for an int value.
I'm sure this was just an oversight. Change to int value.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

a8f1d32d 27-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Eliminate vestigial HIF_FIRST

Holder flag HIF_FIRST is no longer used or needed, so remove it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

7392fbb0 25-Jul-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Make recovery error more readable

Before this patch, withdraws could cause an error that looked like:
Journal recovery skipped for 0 until next mount.
This patch changes it to a more readable:
Journal recovery skipped for jid 0 until next mount.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

70c11ba8 30-Jun-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: Don't release and reacquire local statfs bh

Before this patch, several functions in gfs2 related to the updating
of the statfs file used a newly acquired/read buffer_head for the
local statfs file. This is completely unnecessary, because other nodes
should never update it. Recreating the buffer is a waste of time.

This patch allows gfs2 to read in the local statefs buffer_head at
mount time and keep it around until unmount time.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

a28dc123 14-May-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: init system threads before freeze lock

Patch 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro
functions") changed the gfs2 mount sequence so that it holds the freeze
lock before calling gfs2_make_fs_rw. Before this patch, gfs2_make_fs_rw
called init_threads to initialize the quotad and logd threads. That is a
problem if the system needs to withdraw due to IO errors early in the
mount sequence, for example, while initializing the system statfs inode:

1. An IO error causes the statfs glock to not sync properly after
recovery, and leaves items on the ail list.
2. The leftover items on the ail list causes its do_xmote call to fail,
which makes it want to withdraw. But since the glock code cannot
withdraw (because the withdraw sequence uses glocks) it relies upon
the logd daemon to initiate the withdraw.
3. The withdraw can never be performed by the logd daemon because all
this takes place before the logd daemon is started.

This patch moves function init_threads from super.c to ops_fstype.c
and it changes gfs2_fill_super to start its threads before holding the
freeze lock, and if there's an error, stop its threads after releasing
it. This allows the logd to run unblocked by the freeze lock. Thus,
the logd daemon can perform its withdraw sequence properly.

Fixes: 96b1454f2e8e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>

249dbe74 11-Aug-2021 Arnd Bergmann <arnd@arndb.de>

ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation

The epoll_wait() system call wrapper is one of the remaining users of
the set_fs() infrasturcture for Arm. Changing it to not require set_fs()
is rather complex unfortunately.

The approach I'm taking here is to allow architectures to override
the code that copies the output to user space, and let the oabi-compat
implementation check whether it is getting called from an EABI or OABI
system call based on the thread_info->syscall value.

The in_oabi_syscall() check here mirrors the in_compat_syscall() and
in_x32_syscall() helpers for 32-bit compat implementations on other
architectures.

Overall, the amount of code goes down, at least with the newly added
sys_oabi_epoll_pwait() helper getting removed again. The downside
is added complexity in the source code for the native implementation.
There should be no difference in runtime performance except for Arm
kernels with CONFIG_OABI_COMPAT enabled that now have to go through
an external function call to check which of the two variants to use.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

a9a27d4a 20-Aug-2021 Namjae Jeon <namjae.jeon@samsung.com>

ksmbd: don't set FILE DELETE and FILE_DELETE_CHILD in access mask by default

When there is no dacl in request, ksmbd send dacl that coverted by using
file permission. This patch don't set FILE DELETE and FILE_DELETE_CHILD
in access mask by default.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

dc7674ed 29-Jun-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: tiny cleanup in gfs2_log_reserve

Function gfs2_log_reserve was setting revoke_blks to 0. There's no
need because it calculates it shortly thereafter. This patch removes
the unnecessary set.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

69a61144 24-May-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: trivial clean up of gfs2_ail_error

This patch does not change function. It adds variable sdp to clean up
function gfs2_ail_error and make it more readable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

c37453cb 22-Feb-2021 Bob Peterson <rpeterso@redhat.com>

gfs2: be more verbose replaying invalid rgrp blocks

This patch adds some crucial information when journal replay detects a
replay of an obsolete rgrp block. For example, it wasn't printing the
journal id or the generation number played. This just supplements what
is logged in this unusual case.

The function that actually complains about the replaying of an obsolete
rgrp block has been split off to avoid long lines and sparse warnings.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

4c7f65ae 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: rename buffer cache index variable b_bn

To stop external users from using b_bn as the disk address of the
buffer, rename it to b_rhash_key to indicate that it is the buffer
cache index, not the block number of the buffer. Code that needs the
disk address should use xfs_buf_daddr() to obtain it.

Do the rename and clean up any of the remaining internal b_bn users.
Also clean up any remaining b_bn cruft that is now unused.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

9343ee76 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert bp->b_bn references to xfs_buf_daddr()

Stop directly referencing b_bn in code outside the buffer cache, as
b_bn is supposed to be used only as an internal cache index.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

04fcad80 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: introduce xfs_buf_daddr()

Introduce a helper function xfs_buf_daddr() to extract the disk
address of the buffer from the struct xfs_buf. This will replace
direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
the XFS_BUF_ADDR() macro.

This patch introduces the helper function and replaces all uses of
XFS_BUF_ADDR() as this is just a simple sed replacement.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

cf28e17c 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: kill xfs_sb_version_has_v3inode()

All callers to xfs_dinode_good_version() and XFS_DINODE_SIZE() in
both the kernel and userspace have a xfs_mount structure available
which means they can use mount features checks instead looking
directly are the superblock.

Convert these functions to take a mount and use a xfs_has_v3inodes()
check and move it out of the libxfs/xfs_format.h file as it really
doesn't have anything to do with the definition of the on-disk
format.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

d6837c1a 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: introduce xfs_sb_is_v5 helper

Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
checks everywhere, add a simple wrapper to encapsulate this and make
the code easier to read.

This allows us to remove the xfs_sb_version_has_v3inode() wrapper
which is only used in xfs_format.h now and is just a version number
check.

There are a couple of places where we should be checking the mount
feature bits rather than the superblock version (e.g. remount), so
those are converted to use xfs_has_crc(mp) instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

2beb7b50 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: remove unused xfs_sb_version_has wrappers

The vast majority of these wrappers are now unused. Remove them
leaving just the small subset of wrappers that are used to either
add feature bits or make the mount features field setup code
simpler.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

ebd9027d 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert xfs_sb_version_has checks to use mount features

This is a conversion of the remaining xfs_sb_version_has..(sbp)
checks to use xfs_has_..(mp) feature checks.

This was largely done with a vim replacement macro that did:

:0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR>

A couple of other variants were also used, and the rest touched up
by hand.

$ size -t fs/xfs/built-in.a
text data bss dec hex filename
before 1127533 311352 484 1439369 15f689 (TOTALS)
after 1125360 311352 484 1437196 15ee0c (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

55fafb31 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert scrub to use mount-based feature checks

The scrub feature checks are the last place that the superblock
feature checks are used. Convert them to mount based feature checks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

fe08cc50 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: open code sb verifier feature checks

The superblock verifiers are one of the last places that use the sb
version functions to do feature checks. This are all quite simple
uses, and there aren't many of them so open code them all.

Also, move the good version number check into xfs_sb.c instead of it
being an inline function in xfs_format.h

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

03288b19 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert xfs_fs_geometry to use mount feature checks

Reporting filesystem features to userspace is currently superblock
based. Now we have a general mount-based feature infrastructure,
switch to using the xfs_mount rather than the superblock directly.

This reduces the size of the function by over 300 bytes.

$ size -t fs/xfs/built-in.a
text data bss dec hex filename
before 1127855 311352 484 1439691 15f7cb (TOTALS)
after 1127535 311352 484 1439371 15f68b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

75c8c50f 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown

Remove the shouty macro and instead use the inline function that
matches other state/feature check wrapper naming. This conversion
was done with sed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

2e973b2c 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert remaining mount flags to state flags

The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.

Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

0560f31a 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert mount flags to features

Replace m_flags feature checks with xfs_has_<feature>() calls and
rework the setup code to set flags in m_features.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

8970a5b8 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: consolidate mount option features in m_features

This provides separation of mount time feature flags from runtime
mount flags and mount option state. It also makes the feature
checks use the same interface as the superblock features. i.e. we
don't care if the feature is enabled by superblock flags or mount
options, we just care if it's enabled or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

38c26bfd 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: replace xfs_sb_version checks with feature flag checks

Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.

Large parts of this conversion were done with sed with commands like
this:

for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done

With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.

The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:

$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


xfs/libxfs/xfs_ag.c
xfs/libxfs/xfs_alloc.c
xfs/libxfs/xfs_alloc_btree.c
xfs/libxfs/xfs_alloc_btree.h
xfs/libxfs/xfs_attr_leaf.c
xfs/libxfs/xfs_attr_remote.c
xfs/libxfs/xfs_bmap.c
xfs/libxfs/xfs_bmap_btree.c
xfs/libxfs/xfs_bmap_btree.h
xfs/libxfs/xfs_btree.c
xfs/libxfs/xfs_da_btree.c
xfs/libxfs/xfs_da_format.h
xfs/libxfs/xfs_dir2_block.c
xfs/libxfs/xfs_dir2_data.c
xfs/libxfs/xfs_dir2_leaf.c
xfs/libxfs/xfs_dir2_node.c
xfs/libxfs/xfs_dir2_sf.c
xfs/libxfs/xfs_dquot_buf.c
xfs/libxfs/xfs_format.h
xfs/libxfs/xfs_ialloc.c
xfs/libxfs/xfs_ialloc_btree.c
xfs/libxfs/xfs_ialloc_btree.h
xfs/libxfs/xfs_inode_buf.c
xfs/libxfs/xfs_log_format.h
xfs/libxfs/xfs_log_rlimit.c
xfs/libxfs/xfs_refcount_btree.c
xfs/libxfs/xfs_rmap.c
xfs/libxfs/xfs_rmap_btree.c
xfs/libxfs/xfs_sb.c
xfs/libxfs/xfs_symlink_remote.c
xfs/libxfs/xfs_trans_resv.c
xfs/libxfs/xfs_types.c
xfs/scrub/agheader.c
xfs/scrub/agheader_repair.c
xfs/scrub/attr.c
xfs/scrub/bmap.c
xfs/scrub/common.c
xfs/scrub/dabtree.c
xfs/scrub/dir.c
xfs/scrub/ialloc.c
xfs/scrub/inode.c
xfs/scrub/repair.c
xfs/scrub/scrub.c
xfs/xfs_bmap_util.c
xfs/xfs_buf.c
xfs/xfs_buf_item.c
xfs/xfs_buf_item_recover.c
xfs/xfs_dir2_readdir.c
xfs/xfs_dquot.c
xfs/xfs_dquot_item_recover.c
xfs/xfs_file.c
xfs/xfs_fsmap.c
xfs/xfs_fsops.c
xfs/xfs_icache.c
xfs/xfs_inode.c
xfs/xfs_inode_item.c
xfs/xfs_inode_item_recover.c
xfs/xfs_ioctl.c
xfs/xfs_iops.c
xfs/xfs_itable.c
xfs/xfs_log.c
xfs/xfs_log_recover.c
xfs/xfs_mount.c
xfs/xfs_mount.h
xfs/xfs_qm.c
xfs/xfs_qm_bhv.c
xfs/xfs_qm_syscalls.c
xfs/xfs_refcount_item.c
xfs/xfs_reflink.c
xfs/xfs_rmap_item.c
xfs/xfs_rtalloc.c
xfs/xfs_super.c
xfs/xfs_symlink.c
xfs/xfs_trans.c
xfs/xfs_trans_dquot.c
a1d86e8d 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: reflect sb features in xfs_mount

Currently on-disk feature checks require decoding the superblock
fileds and so can be non-trivial. We have almost 400 hundred
individual feature checks in the XFS code, so this is a significant
amount of code. To reduce runtime check overhead, pre-process all
the version flags into a features field in the xfs_mount at mount
time so we can convert all the feature checks to a simple flag
check.

There is also a need to convert the dynamic feature flags to update
the m_features field. This is required for attr, attr2 and quota
features. New xfs_mount based wrappers are added for this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

e23b55d5 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: rework attr2 feature and mount options

The attr2 feature is somewhat unique in that it has both a superblock
feature bit to enable it and mount options to enable and disable it.

Back when it was first introduced in 2005, attr2 was disabled unless
either the attr2 superblock feature bit was set, or the attr2 mount
option was set. If the superblock feature bit was not set but the
mount option was set, then when the first attr2 format inode fork
was created, it would set the superblock feature bit. This is as it
should be - the superblock feature bit indicated the presence of the
attr2 on disk format.

The noattr2 mount option, however, did not affect the superblock
feature bit. If noattr2 was specified, the on-disk superblock
feature bit was ignored and the code always just created attr1
format inode forks. If neither of the attr2 or noattr2 mounts
option were specified, then the behaviour was determined by the
superblock feature bit.

This was all pretty sane.

Fast foward 3 years, and we are dealing with fallout from the
botched sb_features2 addition and having to deal with feature
mismatches between the sb_features2 and sb_bad_features2 fields. The
attr2 feature bit was one of these flags. The reconciliation was
done well after mount option parsing and, unfortunately, the feature
reconciliation had a bug where it ignored the noattr2 mount option.

For reasons lost to the mists of time, it was decided that resolving
this issue in commit 7c12f296500e ("[XFS] Fix up noattr2 so that it
will properly update the versionnum and features2 fields.") required
noattr2 to clear the superblock attr2 feature bit. This greatly
complicated the attr2 behaviour and broke rules about feature bits
needing to be set when those specific features are present in the
filesystem.

By complicated, I mean that it introduced problems due to feature
bit interactions with log recovery. All of the superblock feature
bit checks are done prior to log recovery, but if we crash after
removing a feature bit, then on the next mount we see the feature
bit in the unrecovered superblock, only to have it go away after the
log has been replayed. This means our mount time feature processing
could be all wrong.

Hence you can mount with noattr2, crash shortly afterwards, and
mount again without attr2 or noattr2 and still have attr2 enabled
because the second mount sees attr2 still enabled in the superblock
before recovery runs and removes the feature bit. It's just a mess.

Further, this is all legacy code as the v5 format requires attr2 to
be enabled at all times and it cannot be disabled. i.e. the noattr2
mount option returns an error when used on v5 format filesystems.

To straighten this all out, this patch reverts the attr2/noattr2
mount option behaviour back to the original behaviour. There is no
reason for disabling attr2 these days, so we will only do this when
the noattr2 mount option is set. This will not remove the superblock
feature bit. The superblock bit will provide the default behaviour
and only track whether attr2 is present on disk or not. The attr2
mount option will enable the creation of attr2 format inode forks,
and if the superblock feature bit is not set it will be added when
the first attr2 inode fork is created.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

51b495eb 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: rename xfs_has_attr()

xfs_has_attr() is poorly named. It has global scope as it is defined
in a header file, but it has no namespace scope that tells us what
it is checking has attributes. It's not even clear what "has_attr"
means, because what it is actually doing is an attribute fork lookup
to see if the attribute exists.

Upcoming patches use this "xfs_has_<foo>" namespace for global
filesystem features, which conflicts with this function.

Rename xfs_has_attr() to xfs_attr_lookup() and make it a static
function, freeing up the "xfs_has_" namespace for global scope
usage.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

8cf07f3d 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: sb verifier doesn't handle uncached sb buffer

The verifier checks explicitly for bp->b_bn == XFS_SB_DADDR to match
the primary superblock buffer, but the primary superblock is an
uncached buffer and so bp->b_bn is always -1ULL. Hence this never
matches and the CRC error reporting is wholly dependent on the
mount superblock already being populated so CRC feature checks pass
and allow CRC errors to be reported.

Fix this so that the primary superblock CRC error reporting is not
dependent on already having read the superblock into memory.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

e5f2e54a 18-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: start documenting common units and tags used in tracepoints

Because there are a lot of tracepoints that express numeric data with
an associated unit and tag, document what they are to help everyone else
keep these thigns straight.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

c03e4b9e 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: decode scrub flags in ftrace output

When using pretty-printed scrub tracepoints, decode the meaning of the
scrub flags as strings for easier reading.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

b641851c 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize inode generation formatting in ftrace output

Always print inode generation in hexadecimal and preceded with the unit
"gen".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

7eac3029 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize remaining xfs_buf length tracepoints

For the remaining xfs_buf tracepoints, convert all the tags to
xfs_daddr_t units and retag them 'daddrcount' to match everything else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

f93f85f7 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: resolve fork names in trace output

Emit whichfork values as text strings in the ftrace output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

c23460eb 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: rename i_disk_size fields in ftrace output

Whenever we record i_disk_size (i.e. the ondisk file size), use the
"disize" tag and hexadecimal format consistently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

d538cf24 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: disambiguate units for ftrace fields tagged "count"

Some of our tracepoints have a field known as "count". That name
doesn't describe any units, which makes the fields not very useful.
Rename the fields to capture units and ensure the format is hexadecimal
when we're referring to blocks, extents, or IO operations.

"fsbcount" are in units of fs blocks
"bytecount" are in units of bytes

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

7989accc 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: disambiguate units for ftrace fields tagged "len"

Some of our tracepoints have a field known as "len". That name doesn't
describe any units, which makes the fields not very useful. Rename the
fields to capture units and ensure the format is hexadecimal.

"fsbcount" are in units of fs blocks
"bbcount" are in units of 512b blocks
"ireccount" are in units of inodes

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

49e68c91 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: disambiguate units for ftrace fields tagged "offset"

Some of our tracepoints describe fields as "offset". That name doesn't
describe any units, which makes the fields not very useful. Rename the
fields to capture units and ensure the format is hexadecimal.

"fileoff" means file offset, in units of fs blocks
"pos" means file offset, in bytes
"forkoff" means inode fork offset, in bytes

The one remaining "offset" value is for iclogs, since that's the byte
offset of the end of where we've written into the current iclog.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

6f25b211 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: disambiguate units for ftrace fields tagged "blkno", "block", or "bno"

Some of our tracepoints describe fields as "blkno", "block", or "bno".
That name doesn't describe any units, which makes the fields not very
useful. Rename the fields to capture units and ensure the format is
hexadecimal.

"startblock" is the startblock field from the bmap structure, which is a
segmented fsblock on the data device, or an rfsblock on the realtime
device.
"fileoff" is a file offset, in units of filesystem blocks
"daddr" is a raw device offset, in 512b blocks

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

92eff386 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize daddr formatting in ftrace output

Always print disk addr (i.e. 512 byte block) numbers in hexadecimal and
preceded with the unit "daddr".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

97f4f915 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize rmap owner number formatting in ftrace output

Always print rmap owner number in hexadecimal and preceded with the unit
"owner".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

f7b08163 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize AG block number formatting in ftrace output

Always print allocation group block numbers in hexadecimal and preceded
with the unit "agbno".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

9febf39d 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize AG number formatting in ftrace output

Always print allocation group numbers in hexadecimal and preceded with
the unit "agno".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

af6265a0 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: standardize inode number formatting in ftrace output

Always print inode numbers in hexadecimal and preceded with the unit
"ino" or "agino", as apropriate. Fix one tracepoint that used "ino %u"
for an inode btree block count to reduce confusion.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

3fd7cb84 17-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: fix incorrect unit conversion in scrub tracepoint

XFS_DADDR_TO_FSB converts a raw disk address (in units of 512b blocks)
to a raw disk address (in units of fs blocks). Unfortunately, the
xchk_block_error_class tracepoints incorrectly uses this to decode
xfs_daddr_t into segmented AG number and AG block addresses. Use the
correct translation code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

aee742c9 18-Aug-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: fix return -EINTR on recovery stopped

This patch will return -EINTR instead of 1 if recovery is stopped. In
case of ping_members() the return value will be checked if the error is
-EINTR for signaling another recovery was triggered and the whole
recovery process will come to a clean end to process the next one.
Returning 1 will abort the recovery process and can leave the recovery
in a broken state.

It was reported with the following kernel log message attached and a gfs2
mount stopped working:

"dlm: bobvirt1: dlm_recover_members error 1"

whereas 1 was returned because of a conversion of "dlm_recovery_stopped()"
to an errno was missing which this patch will introduce. While on it all
other possible missing errno conversions at other places were added as
they are done as in other places.

It might be worth to check the error case at this recovery level,
because some of the functionality also returns -ENOBUFS and check why
recovery ends in a broken state. However this will fix the issue if
another recovery was triggered at some points of recovery handling.

Reported-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>

b97f8525 18-Aug-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: implement delayed ack handling

This patch changes that we don't ack each message. Lowcomms will take
care about to send an ack back after a bulk of messages was processed.
Currently it's only when the whole receive buffer was processed, there
might better positions to send an ack back but only the lowcomms
implementation know when there are more data to receive. This patch has
also disadvantages that we might retransmit more on errors, however this
is a very rare case.

Tested with make_panic on gfs2 with three nodes by running:

trace-cmd record -p function -l 'dlm_send_ack' sleep 100

and

trace-cmd report | wc -l

Before patch:
- 20548
- 21376
- 21398

After patch:
- 18338
- 20679
- 19949

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>

a437b9b4 13-Aug-2021 Christoph Hellwig <hch@lst.de>

xfs: remove support for untagged lookups in xfs_icwalk*

With quotaoff not allowing disabling of accounting there is no need
for untagged lookups in this code, so remove the dead leftovers.

Repoted-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: convert to for_each_perag_tag]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

32816fd7 12-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: constify btree function parameters that are not modified

Constify the rest of the btree functions that take structure and union
pointers and are not supposed to modify them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

60e265f7 12-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the start pointer passed to btree update_lastrec functions const

This btree function is called when updating a record in the rightmost
block of a btree so that we can update the AGF's longest free extent
length field. Neither parameter is supposed to be updated, so mark them
both const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

deb06b9a 12-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the start pointer passed to btree alloc_block functions const

The @start pointer passed to each per-AG btree type's ->alloc_block
function isn't supposed to be modified, since it's a hint about the
location of the btree block being split that is to be fed to the
allocator, so mark the parameter const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

b5a6e5fe 12-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the pointer passed to btree set_root functions const

The pointer passed to each per-AG btree type's ->set_root function isn't
supposed to be modified (that function sets an external pointer to the
root block) so mark them const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

22ece4e8 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: mark the record passed into xchk_btree functions as const

xchk_btree calls a user-supplied function to validate each btree record
that it finds. Those functions are not supposed to change the record
data, so mark the parameter const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

8e38dc88 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the keys and records passed to btree inorder functions const

The inorder functions are simple predicates, which means that they don't
modify the parameters. Mark them all const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

23825cd1 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: mark the record passed into btree init_key functions as const

These functions initialize a key from a record, but they aren't supposed
to modify the record. Mark it const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

159eb69d 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the record pointer passed to query_range functions const

The query_range functions are supposed to call a caller-supplied
function on each record found in the dataset. These functions don't
own the memory storing the record, so don't let them change the record.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

04dcb474 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the key parameters to all btree query range functions const

Range query functions are not supposed to modify the query keys that are
being passed in, so mark them all const.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

d29d5577 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make the key parameters to all btree key comparison functions const

The btree key comparison functions are not allowed to change the keys
that are passed in, so mark them const. We'll need this for the next
patch, which adds const to the btree range query functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

7f89c838 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: add trace point for fs shutdown

Add a tracepoint for fs shutdowns so we can capture that in ftrace
output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

54406764 09-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: remove unnecessary agno variable from struct xchk_ag

Now that we always grab an active reference to a perag structure when
dealing with perag metadata, we can remove this unnecessary variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

7e1826e0 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make fsmap backend function key parameters const

There are several GETFSMAP backend functions for XFS to cover the three
devices and various feature support. Each of these functions are passed
pointers to the low and high keys for the dataset that userspace
requested, and a pointer to scratchpad variables that are used to
control the iteration and fill out records. The scratchpad data can be
changed arbitrarily, but the keys are supposed to remain unchanged (and
under the control of the outermost loop in xfs_getfsmap).

Unfortunately, the data and rt backends modify the keys that are passed
in from the main control loop, which causes subsequent calls to return
incorrect query results. Specifically, each of those two functions set
the block number in the high key to the size of their respective device.
Since fsmap results are sorted in device number order, if the lower
numbered device is smaller than the higher numbered device, the first
function will set the high key to the small size, and the key remains
unchanged as it is passed into the function for the higher numbered
device. The second function will then fail to return all of the results
for the dataset that userspace is asking for because the keyspace is
incorrectly constrained.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

9ab72f22 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: fix off-by-one error when the last rt extent is in use

The fsmap implementation for realtime devices uses the gap between
info->next_daddr and a free rtextent reported by xfs_rtalloc_query_range
to feed userspace fsmap records with an "unknown" owner. We use this
trick to report to userspace when the last rtextent in the filesystem is
in use by synthesizing a null rmap record starting at the next block
after the query range.

Unfortunately, there's a minor accounting bug in the way that we
construct the null rmap record. Originally, ahigh.ar_startext contains
the last rtextent for which the user wants records. It's entirely
possible that number is beyond the end of the rt volume, so the location
synthesized rmap record /must/ be constrained to the minimum of the high
key and the number of extents in the rt volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

c02f6529 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: make xfs_rtalloc_query_range input parameters const

In commit 8ad560d2565e, we changed xfs_rtalloc_query_range to constrain
the range of bits in the realtime bitmap file that would actually be
searched. In commit a3a374bf1889, we changed the range again
(incorrectly), leading to the fix in commit d88850bd5516, which finally
corrected the range check code. Unfortunately, the author never noticed
that the function modifies its input parameters, which is a totaly no-no
since none of the other range query functions change their input
parameters.

So, fix this function yet again to stash the upper end of the query
range (i.e. the high key) in a local variable and hope this is the last
time I have to fix my own function. While we're at it, mark the key
inputs const so nobody makes this mistake again. :(

Fixes: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
Not-fixed-by: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
Not-fixed-by: d88850bd5516 ("xfs: fix high key handling in the rt allocator's query_range function")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

21b4ee70 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: drop ->writepage completely

->writepage is only used in one place - single page writeback from
memory reclaim. We only allow such writeback from kswapd, not from
direct memory reclaim, and so it is rarely used. When it comes from
kswapd, it is effectively random dirty page shoot-down, which is
horrible for IO patterns. We will already have background writeback
trying to clean all the dirty pages in memory as efficiently as
possible, so having kswapd interrupt our well formed IO stream only
slows things down. So get rid of xfs_vm_writepage() completely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: forward port to 5.15]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

c0891ac1 02-Aug-2021 Alexey Dobriyan <adobriyan@gmail.com>

isystem: ship and use stdarg.h

Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.

GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

332f606b 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: enable RCU'd ->get_acl()

Overlayfs does not cache ACL's (to avoid double caching). Instead it just
calls the underlying filesystem's i_op->get_acl(), which will return the
cached value, if possible.

In rcu path walk, however, get_cached_acl_rcu() is employed to get the
value from the cache, which will fail on overlayfs resulting in dropping
out of rcu walk mode. This can result in a big performance hit in certain
situations.

Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which
indicates pass-through)

Reported-by: garyhuang <zjh.20052005@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

0cad6246 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: add rcu argument to ->get_acl() callback

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

36ca7943 18-Aug-2021 Xu Yu <xuyu@linux.alibaba.com>

mm/swap: consider max pages in iomap_swapfile_add_extent

When the max pages (last_page in the swap header + 1) is smaller than
the total pages (inode size) of the swapfile, iomap_swapfile_activate
overwrites sis->max with total pages.

However, frontswap_map is a swap page state bitmap allocated using the
initial sis->max page count read from the swap header. If swapfile
activation increases sis->max, it's possible for the frontswap code to
walk off the end of the bitmap, thereby corrupting kernel memory.

[djwong: modify the description a bit; the original paragraph reads:

"However, frontswap_map is allocated using max pages. When test and clear
the sis offset, which is larger than max pages, of frontswap_map in
__frontswap_invalidate_page(), neighbors of frontswap_map may be
overwritten, i.e., slab is polluted."

Note also that this bug resulted in a behavioral change: activating a
swap file that was formatted and later extended results in all pages
being activated, not the number of pages recorded in the swap header.]

This fixes the issue by considering the limitation of max pages of swap
info in iomap_swapfile_add_extent().

To reproduce the case, compile kernel with slub RED ZONE, then run test:
$ sudo stress-ng -a 1 -x softlockup,resources -t 72h --metrics --times \
--verify -v -Y /root/tmpdir/stress-ng/stress-statistic-12.yaml \
--log-file /root/tmpdir/stress-ng/stress-logfile-12.txt \
--temp-path /root/tmpdir/stress-ng/

We'll get the error log as below:

[ 1151.015141] =============================================================================
[ 1151.016489] BUG kmalloc-16 (Not tainted): Right Redzone overwritten
[ 1151.017486] -----------------------------------------------------------------------------
[ 1151.017486]
[ 1151.018997] Disabling lock debugging due to kernel taint
[ 1151.019873] INFO: 0x0000000084e43932-0x0000000098d17cae @offset=7392. First byte 0x0 instead of 0xcc
[ 1151.021303] INFO: Allocated in __do_sys_swapon+0xcf6/0x1170 age=43417 cpu=9 pid=3816
[ 1151.022538] __slab_alloc+0xe/0x20
[ 1151.023069] __kmalloc_node+0xfd/0x4b0
[ 1151.023704] __do_sys_swapon+0xcf6/0x1170
[ 1151.024346] do_syscall_64+0x33/0x40
[ 1151.024925] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.025749] INFO: Freed in put_cred_rcu+0xa1/0xc0 age=43424 cpu=3 pid=2041
[ 1151.026889] kfree+0x276/0x2b0
[ 1151.027405] put_cred_rcu+0xa1/0xc0
[ 1151.027949] rcu_do_batch+0x17d/0x410
[ 1151.028566] rcu_core+0x14e/0x2b0
[ 1151.029084] __do_softirq+0x101/0x29e
[ 1151.029645] asm_call_irq_on_stack+0x12/0x20
[ 1151.030381] do_softirq_own_stack+0x37/0x40
[ 1151.031037] do_softirq.part.15+0x2b/0x30
[ 1151.031710] __local_bh_enable_ip+0x4b/0x50
[ 1151.032412] copy_fpstate_to_sigframe+0x111/0x360
[ 1151.033197] __setup_rt_frame+0xce/0x480
[ 1151.033809] arch_do_signal+0x1a3/0x250
[ 1151.034463] exit_to_user_mode_prepare+0xcf/0x110
[ 1151.035242] syscall_exit_to_user_mode+0x27/0x190
[ 1151.035970] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.036795] INFO: Slab 0x000000003b9de4dc objects=44 used=9 fp=0x00000000539e349e flags=0xfffffc0010201
[ 1151.038323] INFO: Object 0x000000004855ba01 @offset=7376 fp=0x0000000000000000
[ 1151.038323]
[ 1151.039683] Redzone 000000008d0afd3d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................
[ 1151.041180] Object 000000004855ba01: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1151.042714] Redzone 0000000084e43932: 00 00 00 c0 cc cc cc cc ........
[ 1151.044120] Padding 000000000864c042: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[ 1151.045615] CPU: 5 PID: 3816 Comm: stress-ng Tainted: G B 5.10.50+ #7
[ 1151.046846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 1151.048633] Call Trace:
[ 1151.049072] dump_stack+0x57/0x6a
[ 1151.049585] check_bytes_and_report+0xed/0x110
[ 1151.050320] check_object+0x1eb/0x290
[ 1151.050924] ? __x64_sys_swapoff+0x39a/0x540
[ 1151.051646] free_debug_processing+0x151/0x350
[ 1151.052333] __slab_free+0x21a/0x3a0
[ 1151.052938] ? _cond_resched+0x2d/0x40
[ 1151.053529] ? __vunmap+0x1de/0x220
[ 1151.054139] ? __x64_sys_swapoff+0x39a/0x540
[ 1151.054796] ? kfree+0x276/0x2b0
[ 1151.055307] kfree+0x276/0x2b0
[ 1151.055832] __x64_sys_swapoff+0x39a/0x540
[ 1151.056466] do_syscall_64+0x33/0x40
[ 1151.057084] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1151.057866] RIP: 0033:0x150340b0ffb7
[ 1151.058481] Code: Unable to access opcode bytes at RIP 0x150340b0ff8d.
[ 1151.059537] RSP: 002b:00007fff7f4ee238 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
[ 1151.060768] RAX: ffffffffffffffda RBX: 00007fff7f4ee66c RCX: 0000150340b0ffb7
[ 1151.061904] RDX: 000000000000000a RSI: 0000000000018094 RDI: 00007fff7f4ee860
[ 1151.063033] RBP: 00007fff7f4ef980 R08: 0000000000000000 R09: 0000150340a672bd
[ 1151.064135] R10: 00007fff7f4edca0 R11: 0000000000000246 R12: 0000000000018094
[ 1151.065253] R13: 0000000000000005 R14: 000000000160d930 R15: 00007fff7f4ee66c
[ 1151.066413] FIX kmalloc-16: Restoring 0x0000000084e43932-0x0000000098d17cae=0xcc
[ 1151.066413]
[ 1151.067890] FIX kmalloc-16: Object at 0x000000004855ba01 not freed

Fixes: 67482129cdab ("iomap: add a swapfile activation function")
Fixes: a45c0eccc564 ("iomap: move the swapfile code into a separate file")
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

d6d09a69 18-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"One more fix for cross-rename, adding a missing check for directory
and subvolume, this could lead to a crash"

* tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent rename2 from exchanging a subvol with a directory from different parents


3b844826 05-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

pipe: avoid unnecessary EPOLLET wakeups under normal loads

I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.

Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.

It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.

I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.

Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.

[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]

Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eadcd6b5 12-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: add fiemap support with iomap

This adds fiemap support for both uncompressed files and compressed
files by using iomap infrastructure.

Link: https://lore.kernel.org/r/20210813052931.203280-3-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

d95ae5e2 18-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: add support for the full decompressed length

Previously, there is no need to get the full decompressed length since
EROFS supports partial decompression. However for some other cases
such as fiemap, the full decompressed length is necessary for iomap to
make it work properly.

This patch adds a way to get the full decompressed length. Note that
it takes more metadata overhead and it'd be avoided if possible in the
performance sensitive scenario.

Link: https://lore.kernel.org/r/20210818152231.243691-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

202bc942 10-Aug-2021 Damien Le Moal <damien.lemoal@wdc.com>

block: Introduce IOPRIO_NR_LEVELS

The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.

Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9cb0073b 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: pin ctx on fallback execution

Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in
case any task_work handler touches ctx after releasing a request.

Fixes: 9011bf9a13e3b ("io_uring: fix stuck fallback reqs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

76224355 17-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: truncate pagecache on atomic_o_trunc

fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().

Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.

Reported-by: Xie Yongji <xieyongji@bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

bbe1da7e 05-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: compress: do sanity check on cluster

This patch adds f2fs_sanity_check_cluster() to support doing
sanity check on cluster of compressed file, it will be triggered
from below two paths:

- __f2fs_cluster_blocks()
- f2fs_map_blocks(F2FS_GET_BLOCK_FIEMAP)

And it can detect below three kind of cluster insanity status.

C: COMPRESS_ADDR
N: NULL_ADDR or NEW_ADDR
V: valid blkaddr
*: any value

1. [*|C|*|*]
2. [C|*|C|*]
3. [C|N|N|V]

Signed-off-by: Chao Yu <chao@kernel.org>
[Nathan Chancellor: fix missing inline warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

491f7f71 14-Aug-2021 Yangtao Li <frank.li@vivo.com>

f2fs: convert S_IRUGO to 0444

To fix:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b96d9b3b 05-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix to keep compatibility of fault injection interface

The value of FAULT_* macros and its description in f2fs.rst became
inconsistent, fix this to keep compatibility of fault injection
interface.

Fixes: 67883ade7a98 ("f2fs: remove FAULT_ALLOC_BIO")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

32410577 08-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: support fault injection for f2fs_kmem_cache_alloc()

This patch supports to inject fault into f2fs_kmem_cache_alloc().

Usage:
a) echo 32768 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=32768 <dev> <mountpoint>

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

4a4fc043 08-Aug-2021 Fengnan Chang <changfengnan@vivo.com>

f2fs: compress: allow write compress released file after truncate to zero

For compressed file, after release compress blocks, don't allow write
direct, but we should allow write direct after truncate to zero.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

f7104cc1 12-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfsd4: Fix forced-expiry locking

This should use the network-namespace-wide client_lock, not the
per-client cl_lock.

You shouldn't see any bugs unless you're actually using the
forced-expiry interface introduced by 89c905beccbb.

Fixes: 89c905beccbb "nfsd: allow forced expiration of NFSv4 clients"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d02a3a2c 02-Aug-2021 Jia He <hejianet@gmail.com>

lockd: change the proc_handler for nsm_use_hostnames

nsm_use_hostnames is a module parameter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to bool to/from int will cause
error for proc items.

This patch use a new proc_handler proc_dobool to fix it.

Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
[thuth: Fix typo in commit message]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ea49dc79 27-Jul-2021 NeilBrown <neilb@suse.de>

NFSD: remove vanity comments

Including one's name in copyright claims is appropriate. Including it
in random comments is just vanity. After 2 decades, it is time for
these to be gone.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

cd2d644d 26-Jul-2021 Benjamin Coddington <bcodding@redhat.com>

lockd: Fix invalid lockowner cast after vfs_test_lock

After calling vfs_test_lock() the pointer to a conflicting lock can be
returned, and that lock is not guarunteed to be owned by nlm. In that
case, we cannot cast it to struct nlm_lockowner. Instead return the pid
of that conflicting lock.

Fixes: 646d73e91b42 ("lockd: Show pid of lockd for remote locks")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d27b74a8 14-May-2021 Chuck Lever <chuck.lever@oracle.com>

NFSD: Use new __string_len C macros for nfsd_clid_class

Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

408c0de7 12-May-2021 Chuck Lever <chuck.lever@oracle.com>

NFSD: Use new __string_len C macros for the nfs_dirent tracepoint

Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

496d83cf 28-Jun-2021 Chuck Lever <chuck.lever@oracle.com>

NFSD: Batch release pages during splice read

Large splice reads call put_page() repeatedly. put_page() is
relatively expensive to call, so replace it with the new
svc_rqst_replace_page() helper to help amortize that cost.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>

c7e0b781 28-Jun-2021 Chuck Lever <chuck.lever@oracle.com>

NFSD: Clean up splice actor

A few useful observations:

- The value in @size is never modified.

- splice_desc.len is an unsigned int, and so is xdr_buf.page_len.
An implicit cast to size_t is unnecessary.

- The computation of .page_len is the same in all three arms
of the "if" statement, so hoist it out to make it clear that
the operation is an unconditional invariant.

The resulting function is 18 bytes shorter on my system (-Os).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>

52d5a0c6 16-Aug-2021 chenying <chenying.kernel@bytedance.com>

ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()

If function ovl_instantiate() returns an error, ovl_cleanup will be called
and try to remove newdentry from wdir, but the newdentry has been moved to
udir at this time. This will causes BUG_ON(victim->d_parent->d_inode !=
dir) in fs/namei.c:may_delete.

Signed-off-by: chenying <chenying.kernel@bytedance.com>
Fixes: 01b39dcc9568 ("ovl: use inode_insert5() to hash a newly created inode")
Link: https://lore.kernel.org/linux-unionfs/e6496a94-a161-dc04-c38a-d2544633acb4@bytedance.com/
Cc: <stable@vger.kernel.org> # v4.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

f945ca19 22-Jul-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: use kvalloc in xattr copy-up

Extended attributes are usually small, but could be up to 64k in size, so
use the most efficient method for doing the allocation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

d8991e86 09-Mar-2021 Chengguang Xu <cgxu519@mykernel.net>

ovl: update ctime when changing fileattr

Currently we keep size, mode and times of overlay inode
as the same as upper inode, so should update ctime when
changing file attribution as well.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

b71759ef 24-Apr-2021 Chengguang Xu <cgxu519@mykernel.net>

ovl: skip checking lower file's i_writecount on truncate

It is possible that a directory tree is shared between multiple overlay
instances as a lower layer. In this case when one instance executes a file
residing on the lower layer, the other instance denies a truncate(2) call
on this file.

This only happens for truncate(2) and not for open(2) with the O_TRUNC
flag.

Fix this interference and inconsistency by removing the preliminary
i_writecount check before copy-up.

This means that unlike on normal filesystems truncate(argv[0]) will now
succeed. If this ever causes a regression in a real world use case this
needs to be revisited.

One way to fix this properly would be to keep a correct i_writecount in the
overlay inode, but that is difficult due to memory mapping code only
dealing with the real file/inode.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

ffb24e3c 27-Apr-2021 Amir Goldstein <amir73il@gmail.com>

ovl: relax lookup error on mismatch origin ftype

We get occasional reports of lookup errors due to mismatched
origin ftype from users that re-format a lower squashfs image.

Commit 13c6ad0f45fd ("ovl: document lower modification caveats")
tries to discourage the practice of re-formating lower layers and
describes the expected behavior as undefined.

Commit b0e0f69731cd ("ovl: restrict lower null uuid for "xino=auto"")
limits the configurations in which origin file handles are followed.

In addition to these measures, change the behavior in case of detecting
a mismatch origin ftype in lookup to issue a warning, not follow origin,
but not fail the lookup operation either.

That should make overall more users happy without any big consequences.

Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxgPq9E9xxwU2CDyHy-_yCZZeymg+3n+-6AqkGGE1YtwvQ@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

1fc31aac 27-May-2021 Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>

ovl: do not set overlay.opaque for new directories

Enable optimizations only if user opted-in for any of extended features.
If optimization is enabled, it breaks existing use case when a lower layer
directory appears after directory was created on a merged layer. If
overlay.opaque is applied, new files on lower layer are not visible.

Consider the following scenario:
- /lower and /upper are mounted to /merged
- directory /merged/new-dir is created with a file test1
- overlay is unmounted
- directory /lower/new-dir is created with a file test2
- overlay is mounted again

If opaque is applied by default, file test2 is not going to be visible
without explicitly clearing the overlay.opaque attribute

Signed-off-by: Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

ca45275c 27-May-2021 Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>

ovl: add ovl_allow_offline_changes() helper

Allows to check whether any of extended features are enabled

Signed-off-by: Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

e4522bc8 27-May-2021 Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>

ovl: disable decoding null uuid with redirect_dir

Currently decoding origin with lower null uuid is not allowed unless user
opted-in to one of the new features that require following the lower inode
of non-dir upper (index, xino, metacopy). Now we add redirect_dir too to
that feature list.

Signed-off-by: Vyacheslav Yurkov <Vyacheslav.Yurkov@bruker.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

096a218a 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: consistent behavior for immutable/append-only inodes

When a lower file has immutable/append-only fileattr flags, the behavior of
overlayfs post copy up is inconsistent.

Immediattely after copy up, ovl inode still has the S_IMMUTABLE/S_APPEND
inode flags copied from lower inode, so vfs code still treats the ovl inode
as immutable/append-only. After ovl inode evict or mount cycle, the ovl
inode does not have these inode flags anymore.

We cannot copy up the immutable and append-only fileattr flags, because
immutable/append-only inodes cannot be linked and because overlayfs will
not be able to set overlay.* xattr on the upper inodes.

Instead, if any of the fileattr flags of interest exist on the lower inode,
we store them in overlay.protattr xattr on the upper inode and we read the
flags from xattr on lookup and on fileattr_get().

This gives consistent behavior post copy up regardless of inode eviction
from cache.

When user sets new fileattr flags, we update or remove the overlay.protattr
xattr.

Storing immutable/append-only fileattr flags in an xattr instead of upper
fileattr also solves other non-standard behavior issues - overlayfs can now
copy up children of "ovl-immutable" directories and lower aliases of
"ovl-immutable" hardlinks.

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/linux-unionfs/20201226104618.239739-1-cgxu519@mykernel.net/
Link: https://lore.kernel.org/linux-unionfs/20210210190334.1212210-5-amir73il@gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

72db8211 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: copy up sync/noatime fileattr flags

When a lower file has sync/noatime fileattr flags, the behavior of
overlayfs post copy up is inconsistent.

Immediately after copy up, ovl inode still has the S_SYNC/S_NOATIME
inode flags copied from lower inode, so vfs code still treats the ovl
inode as sync/noatime. After ovl inode evict or mount cycle,
the ovl inode does not have these inode flags anymore.

To fix this inconsistency, try to copy the fileattr flags on copy up
if the upper fs supports the fileattr_set() method.

This gives consistent behavior post copy up regardless of inode eviction
from cache.

We cannot copy up the immutable/append-only inode flags in a similar
manner, because immutable/append-only inodes cannot be linked and because
overlayfs will not be able to set overlay.* xattr on the upper inodes.

Those flags will be addressed by a followup patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

a0c236b1 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: pass ovl_fs to ovl_check_setxattr()

Instead of passing the overlay dentry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

4f911138 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

fs: add generic helper for filling statx attribute flags

The immutable and append-only properties on an inode are published on
the inode's i_flags and enforced by the VFS.

Create a helper to fill the corresponding STATX_ATTR_ flags in the kstat
structure from the inode's i_flags.

Only orange was converted to use this helper.
Other filesystems could use it in the future.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

8d04fbe7 11-Aug-2021 Darrick J. Wong <djwong@kernel.org>

iomap: move loop control code to iter.c

Now that we've moved iomap to the iterator model, rename this file to be
in sync with the functions contained inside of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

fad0a1ab 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: constify iomap_iter_srcmap

The srcmap returned from iomap_iter_srcmap is never modified, so mark
the iomap returned from it const and constify a lot of code that never
modifies the iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

65dd814a 10-Aug-2021 Christoph Hellwig <hch@lst.de>

fsdax: switch the fault handlers to use iomap_iter

Avoid the open coded calls to ->iomap_begin and ->iomap_end and call
iomap_iter instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

c2436190 10-Aug-2021 Shiyang Ruan <ruansy.fnst@fujitsu.com>

fsdax: factor out a dax_fault_actor() helper

The core logic in the two dax page fault functions is similar. So, move
the logic into a common helper function. Also, to facilitate the
addition of new features, such as CoW, switch-case is no longer used to
handle different iomap types.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

55f81639 10-Aug-2021 Shiyang Ruan <ruansy.fnst@fujitsu.com>

fsdax: factor out helpers to simplify the dax fault code

The dax page fault code is too long and a bit difficult to read. And it
is hard to understand when we trying to add new features. Some of the
PTE/PMD codes have similar logic. So, factor out helper functions to
simplify the code.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[hch: minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

b74b1293 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: rework unshare flag

Instead of another internal flags namespace inside of buffered-io.c,
just pass a UNSHARE hint in the main iomap flags field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

1b5c1e36 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: pass an iomap_iter to various buffered I/O helpers

Pass the iomap_iter structure instead of individual parameters to
various internal helpers for buffered I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

57320a01 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: remove iomap_apply

iomap_apply is unused now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: rebase this patch to preserve git history of iomap loop control]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

ca289e0b 10-Aug-2021 Christoph Hellwig <hch@lst.de>

fsdax: switch dax_iomap_rw to use iomap_iter

Switch the dax_iomap_rw implementation to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

3d99a1ce 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_swapfile_activate to use iomap_iter

Switch iomap_swapfile_activate to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

c4740bf1 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_seek_data to use iomap_iter

Rewrite iomap_seek_data to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

40670d18 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_seek_hole to use iomap_iter

Rewrite iomap_seek_hole to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

6d8a1287 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_bmap to use iomap_iter

Rewrite the ->bmap implementation based on iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: restructure the loop to make its behavior a little clearer]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

7892386d 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_fiemap to use iomap_iter

Rewrite the ->fiemap implementation based on iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

a6d3d495 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch __iomap_dio_rw to use iomap_iter

Switch __iomap_dio_rw to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

253564ba 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_page_mkwrite to use iomap_iter

Switch iomap_page_mkwrite to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

2aa3048e 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_zero_range to use iomap_iter

Switch iomap_zero_range to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

8fc274d1 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_file_unshare to use iomap_iter

Switch iomap_file_unshare to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

ce83a025 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch iomap_file_buffered_write to use iomap_iter

Switch iomap_file_buffered_write to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

f6d48000 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: switch readahead and readpage to use iomap_iter

Switch the page cache read functions to use iomap_iter instead of
iomap_apply.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

f4b896c2 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: add the new iomap_iter model

The iomap_iter struct provides a convenient way to package up and
maintain all the arguments to the various mapping and operation
functions. It is operated on using the iomap_iter() function that
is called in loop until the whole range has been processed. Compared
to the existing iomap_apply() function this avoid an indirect call
for each iteration.

For now iomap_iter() calls back into the existing ->iomap_begin and
->iomap_end methods, but in the future this could be further optimized
to avoid indirect calls entirely.

Based on an earlier patch from Matthew Wilcox <willy@infradead.org>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: add to apply.c to preserve git history of iomap loop control]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

740499c7 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: fix the iomap_readpage_actor return value for inline data

The actor should never return a larger value than the length that was
passed in. The current code handles this gracefully, but the opcoming
iter model will be more picky.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

1acd9e9c 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: mark the iomap argument to iomap_read_page_sync const

iomap_read_page_sync never modifies the passed in iomap, so mark
it const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

78c64b00 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: mark the iomap argument to iomap_read_inline_data const

iomap_read_inline_data never modifies the passed in iomap, so mark
it const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

7e4f4b2d 10-Aug-2021 Christoph Hellwig <hch@lst.de>

fsdax: mark the iomap argument to dax_iomap_sector as const

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

6d49cc85 10-Aug-2021 Christoph Hellwig <hch@lst.de>

fs: mark the iomap argument to __block_write_begin_int const

__block_write_begin_int never modifies the passed in iomap, so mark it
const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

1d25d0ae 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: remove the iomap arguments to ->page_{prepare,done}

These aren't actually used by the only instance implementing the methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

d9d381f3 10-Aug-2021 Christoph Hellwig <hch@lst.de>

iomap: fix a trivial comment typo in trace.h

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

b69eea82 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

iomap: pass writeback errors to the mapping

Modern-day mapping_set_error has the ability to squash the usual
negative error code into something appropriate for long-term storage in
a struct address_space -- ENOSPC becomes AS_ENOSPC, and everything else
becomes EIO. iomap squashes /everything/ to EIO, just as XFS did before
that, but this doesn't make sense.

Fix this by making it so that we can pass ENOSPC to userspace when
writeback fails due to space problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

33c0dd78 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: move the CIL workqueue to the CIL

We only use the CIL workqueue in the CIL, so it makes no sense to
hang it off the xfs_mount and have to walk multiple pointers back up
to the mount when we have the CIL structures right there.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

39823d0f 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: CIL work is serialised, not pipelined

Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.

The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.

Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.

However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. This is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

0020a190 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: AIL needs asynchronous CIL forcing

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
400.099623: xfsaild: first lsn 0x11002f3600
400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
400.589349: xfsaild: first lsn 0x1100305000
400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
400.950343: xfsaild: first lsn 0x1100317c00
400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
401.142334: xfsaild: first lsn 0x110032e600
401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
401.154328: xfsaild: first lsn 0x1100353000
401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
401.451526: xfsaild: first lsn 0x1100353000
401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

68a74dca 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: order CIL checkpoint start records

Because log recovery depends on strictly ordered start records as
well as strictly ordered commit records.

This is a zero day bug in the way XFS writes pipelined transactions
to the journal which is exposed by fixing the zero day bug that
prevents the CIL from pipelining checkpoints. This re-introduces
explicit concurrent commits back into the on-disk journal and hence
out of order start records.

The XFS journal commit code has never ordered start records and we
have relied on strict commit record ordering for correct recovery
ordering of concurrently written transactions. Unfortunately, root
cause analysis uncovered the fact that log recovery uses the LSN of
the start record for transaction commit processing. Hence, whilst
the commits are processed in strict order by recovery, the LSNs
associated with the commits can be out of order and so recovery may
stamp incorrect LSNs into objects and/or misorder intents in the AIL
for later processing. This can result in log recovery failures
and/or on disk corruption, sometimes silent.

Because this is a long standing log recovery issue, we can't just
fix log recovery and call it good. This still leaves older kernels
susceptible to recovery failures and corruption when replaying a log
from a kernel that pipelines checkpoints. There is also the issue
that in-memory ordering for AIL pushing and data integrity
operations are based on checkpoint start LSNs, and if the start LSN
is incorrect in the journal, it is also incorrect in memory.

Hence there's really only one choice for fixing this zero-day bug:
we need to strictly order checkpoint start records in ascending
sequence order in the log, the same way we already strictly order
commit records.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

caa80090 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state()

Now that we have a mechanism to guarantee that the callbacks
attached to an iclog are owned by the context that attaches them
until they drop their reference to the iclog via
xlog_state_release_iclog(), we can attach callbacks to the iclog at
any time we have an active reference to the iclog.

xlog_state_get_iclog_space() always guarantees that the commit
record will fit in the iclog it returns, so we can move this IO
callback setting to xlog_cil_set_ctx_write_state(), record the
commit iclog in the context and remove the need for the commit iclog
to be returned by xlog_write() altogether.

This, in turn, allows us to move the wakeup for ordered commit
record writes up into xlog_cil_set_ctx_write_state(), too, because
we have been guaranteed that this commit record will be physically
located in the iclog before any waiting commit record at a higher
sequence number will be granted iclog space.

This further cleans up the post commit record write processing in
the CIL push code, especially as xlog_state_release_iclog() will now
clean up the context when shutdown errors occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

bf034bc8 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: factor out log write ordering from xlog_cil_push_work()

So we can use it for start record ordering as well as commit record
ordering in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

c45aba40 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: pass a CIL context to xlog_write()

Pass the CIL context to xlog_write() rather than a pointer to a LSN
variable. Only the CIL checkpoint calls to xlog_write() need to know
about the start LSN of the writes, so rework xlog_write to directly
write the LSNs into the CIL context structure.

This removes the commit_lsn variable from xlog_cil_push_work(), so
now we only have to issue the commit record ordering wakeup from
there.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

2ce82b72 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: move xlog_commit_record to xfs_log_cil.c

It is only used by the CIL checkpoints, and is the counterpart to
start record formatting and writing that is already local to
xfs_log_cil.c.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

2562c322 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: log head and tail aren't reliable during shutdown

I'm seeing assert failures from xlog_space_left() after a shutdown
has begun that look like:

XFS (dm-0): log I/O error -5
XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0
XFS (dm-0): Log I/O Error Detected.
XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): xlog_space_left: head behind tail
XFS (dm-0): tail_cycle = 6, tail_bytes = 2706944
XFS (dm-0): GH cycle = 6, GH bytes = 1633867
XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310
------------[ cut here ]------------
Call Trace:
xlog_space_left+0xc3/0x110
xlog_grant_push_threshold+0x3f/0xf0
xlog_grant_push_ail+0x12/0x40
xfs_log_reserve+0xd2/0x270
? __might_sleep+0x4b/0x80
xfs_trans_reserve+0x18b/0x260
.....

There are two things here. Firstly, after a shutdown, the log head
and tail can be out of whack as things abort and release (or don't
release) resources, so checking them for sanity doesn't make much
sense. Secondly, xfs_log_reserve() can race with shutdown and so it
can still fail like this even though it has already checked for a
log shutdown before calling xlog_grant_push_ail().

So, before ASSERT failing in xlog_space_left(), make sure we haven't
already shut down....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

502a01fa 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: don't run shutdown callbacks on active iclogs

When the log is shutdown, it currently walks all the iclogs and runs
callbacks that are attached to the iclogs, regardless of whether the
iclog is queued for IO completion or not. This creates a problem for
contexts attaching callbacks to iclogs in that a racing shutdown can
run the callbacks even before the attaching context has finished
processing the iclog and releasing it for IO submission.

If the callback processing of the iclog frees the structure that is
attached to the iclog, then this leads to an UAF scenario that can
only be protected against by holding the icloglock from the point
callbacks are attached through to the release of the iclog. While we
currently do this, it is not practical or sustainable.

Hence we need to make shutdown processing the responsibility of the
context that holds active references to the iclog. We know that the
contexts attaching callbacks to the iclog must have active
references to the iclog, and that means they must be in either
ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
iclogs in these states -except- when the log is shut down.

xlog_state_do_callback() checks the state of the iclogs while
holdi