History log of /linux-master/fs/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
da4d34b6 22-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"Two fixes for the max workers limit API that was introduced this
series: one fix for an issue with that code, and one fixing a linked
timeout regression in this series"

* tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
io_uring: apply worker limits to previous users
io_uring: fix ltimeout unprep
io_uring: apply max_workers limit to all future users
io-wq: max_worker fixes


5ab2ed0a 22-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
"Syzbot discovered a race in case of reusing the fuse sb (introduced in
this cycle).

Fix it by doing the s_fs_info initialization at the proper place"

* tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up error exits in fuse_fill_super()
fuse: always initialize sb->s_fs_info
fuse: clean up fuse_mount destruction
fuse: get rid of fuse_put_super()
fuse: check s_root when destroying sb


b22fa62a 21-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: apply worker limits to previous users

Another change to the API io-wq worker limitation API added in 5.15,
apply the limit to all prior users that already registered a tctx. It
may be confusing as it's now, in particular the change covers the
following 2 cases:

TASK1 | TASK2
_________________________________________________
ring = create() |
| limit_iowq_workers()
*not limited* |

TASK1 | TASK2
_________________________________________________
ring = create() |
| issue_requests()
limit_iowq_workers() |
| *not limited*

A note on locking, it's safe to traverse ->tctx_list as we hold
->uring_lock, but do that after dropping sqd->lock to avoid possible
problems. It's also safe to access tctx->io_wq there because tasks
kill it only after removing themselves from tctx_list, see
io_uring_cancel_generic() -> io_uring_clean_tctx()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

964d32e5 21-Oct-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: clean up error exits in fuse_fill_super()

Instead of "goto err", return error directly, since there's no error
cleanup to do now.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

80019f11 21-Oct-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: always initialize sb->s_fs_info

Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.

This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list. The initialization
needs to be done in sget_fc() with the sb_lock held.

Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().

After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".

Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c76 ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

c191cd07 21-Oct-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: clean up fuse_mount destruction

1. call fuse_mount_destroy() for open coded variants

2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)

3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

a27c061a 21-Oct-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: get rid of fuse_put_super()

The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb. This is called from kill_***_super(), which is
called from ->kill_sb instances.

Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.

This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances. A
follup patch will clean up the error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

d534d31d 21-Oct-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: check s_root when destroying sb

Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.

This change will allow cleanup of sb setup error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

2f111a6f 20-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
"Two important filesystem fixes, marked for stable.

The blocklisted superblocks issue was particularly annoying because
for unexperienced users it essentially exacted a reboot to establish a
new functional mount in that scenario"

* tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
ceph: fix handling of "meta" errors
ceph: skip existing superblocks that are blocklisted or shut down when mounting


4ea672ab 20-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix ltimeout unprep

io_unprep_linked_timeout() is broken, first it needs to return back
REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But
now we refcounted it, and linked timeouts may get not executed at all,
leaking a request.

Just kill the unprep optimisation.

Fixes: 906c6caaf586 ("io_uring: optimise io_prep_linked_timeout()")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e139a1ec 19-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: apply max_workers limit to all future users

Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task
that issued it, it's unexpected for users. If one task creates a ring,
limits workers and then passes it to another task the limit won't be
applied to the other task.

Another pitfall is that a task should either create a ring or submit at
least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all,
furher complicating the picture.

Change the API, save the limits and apply to all future users. Note, it
should be done first before giving away the ring or submitting new
requests otherwise the result is not guaranteed.

Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bc369921 19-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

io-wq: max_worker fixes

First, fix nr_workers checks against max_workers, with max_worker
registration, it may pretty easily happen that nr_workers > max_workers.

Also, synchronise writing to acct->max_worker with wqe->lock. It's not
an actual problem, but as we don't care about io_wqe_create_worker(),
it's better than WRITE_ONCE()/READ_ONCE().

Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1bd85aa6 07-Oct-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix handling of "meta" errors

Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

98d0a6fb 30-Sep-2021 Jeff Layton <jlayton@kernel.org>

ceph: skip existing superblocks that are blocklisted or shut down when mounting

Currently when mounting, we may end up finding an existing superblock
that corresponds to a blocklisted MDS client. This means that the new
mount ends up being unusable.

If we've found an existing superblock with a client that is already
blocklisted, and the client is not configured to recover on its own,
fail the match. Ditto if the superblock has been forcibly unmounted.

While we're in here, also rename "other" to the more conventional "fsc".

Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

032146cd 18-Oct-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

vfs: check fd has read access in kernel_read_file_from_fd()

If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():

if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))

This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.

Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b15fa922 18-Oct-2021 Valentin Vidic <vvidic@valentin-vidic.from.hr>

ocfs2: mount fails with buffer overflow in strlen

Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
trace below. Problem seems to be that strings for cluster stack and
cluster name are not guaranteed to be null terminated in the disk
representation, while strlcpy assumes that the source string is always
null terminated. This causes a read outside of the source string
triggering the buffer overflow detection.

detected buffer overflow in strlen
------------[ cut here ]------------
kernel BUG at lib/string.c:1149!
invalid opcode: 0000 [#1] SMP PTI
CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
Debian 5.14.6-2
RIP: 0010:fortify_panic+0xf/0x11
...
Call Trace:
ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
ocfs2_fill_super+0x359/0x19b0 [ocfs2]
mount_bdev+0x185/0x1b0
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
path_mount+0x454/0xa20
__x64_sys_mount+0x103/0x140
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5314454e 18-Oct-2021 Jan Kara <jack@suse.cz>

ocfs2: fix data corruption after conversion from inline format

Commit 6dbf7bb55598 ("fs: Don't invalidate page buffers in
block_write_full_page()") uncovered a latent bug in ocfs2 conversion
from inline inode format to a normal inode format.

The code in ocfs2_convert_inline_data_to_extents() attempts to zero out
the whole cluster allocated for file data by grabbing, zeroing, and
dirtying all pages covering this cluster. However these pages are
beyond i_size, thus writeback code generally ignores these dirty pages
and no blocks were ever actually zeroed on the disk.

This oversight was fixed by commit 693c241a5f6a ("ocfs2: No need to zero
pages past i_size.") for standard ocfs2 write path, inline conversion
path was apparently forgotten; the commit log also has a reasoning why
the zeroing actually is not needed.

After commit 6dbf7bb55598, things became worse as writeback code stopped
invalidating buffers on pages beyond i_size and thus these pages end up
with clean PageDirty bit but with buffers attached to these pages being
still dirty. So when a file is converted from inline format, then
writeback triggers, and then the file is grown so that these pages
become valid, the invalid dirtiness state is preserved,
mark_buffer_dirty() does nothing on these pages (buffers are already
dirty) but page is never written back because it is clean. So data
written to these pages is lost once pages are reclaimed.

Simple reproducer for the problem is:

xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
-c "pwrite 4000 2000" ocfs2_file

After unmounting and mounting the fs again, you can observe that end of
'ocfs2_file' has lost its contents.

Fix the problem by not doing the pointless zeroing during conversion
from inline format similarly as in the standard write path.

[akpm@linux-foundation.org: fix whitespace, per Joseph]

Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
Fixes: 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: "Markov, Andrey" <Markov.Andrey@Dell.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cb185d5f 18-Oct-2021 Nadav Amit <namit@vmware.com>

userfaultfd: fix a race between writeprotect and exit_mmap()

A race is possible when a process exits, its VMAs are removed by
exit_mmap() and at the same time userfaultfd_writeprotect() is called.

The race was detected by KASAN on a development kernel, but it appears
to be possible on vanilla kernels as well.

Use mmget_not_zero() to prevent the race as done in other userfaultfd
operations.

Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Signed-off-by: Nadav Amit <namit@vmware.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cc0af0a9 17-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
"Just a single fix for a wrong condition for grabbing a lock, a
regression in this merge window"

* tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
io_uring: fix wrong condition to grab uring lock


cf52ad5f 17-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
"Here are some small driver core fixes for 5.15-rc6, all of which have
been in linux-next for a while with no reported issues.

They include:

- kernfs negative dentry bugfix

- simple pm bus fixes to resolve reported issues"

* tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: bus: Delete CONFIG_SIMPLE_PM_BUS
drivers: bus: simple-pm-bus: Add support for probing simple bus only devices
driver core: Reject pointless SYNC_STATE_ONLY device links
kernfs: don't create a negative dentry if inactive node exists


86a44e90 15-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ntfs3_for_5.15' of git://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 fixes from Konstantin Komarov:
"Use the new api for mounting as requested by Christoph.

Also fixed:

- some memory leaks and panic

- xfstests (tested on x86_64) generic/016 generic/021 generic/022
generic/041 generic/274 generic/423

- some typos, wrong returned error codes, dead code, etc"

* tag 'ntfs3_for_5.15' of git://github.com/Paragon-Software-Group/linux-ntfs3: (70 commits)
fs/ntfs3: Check for NULL pointers in ni_try_remove_attr_list
fs/ntfs3: Refactor ntfs_read_mft
fs/ntfs3: Refactor ni_parse_reparse
fs/ntfs3: Refactor ntfs_create_inode
fs/ntfs3: Refactor ntfs_readlink_hlp
fs/ntfs3: Rework ntfs_utf16_to_nls
fs/ntfs3: Fix memory leak if fill_super failed
fs/ntfs3: Keep prealloc for all types of files
fs/ntfs3: Remove unnecessary functions
fs/ntfs3: Forbid FALLOC_FL_PUNCH_HOLE for normal files
fs/ntfs3: Refactoring of ntfs_set_ea
fs/ntfs3: Remove locked argument in ntfs_set_ea
fs/ntfs3: Use available posix_acl_release instead of ntfs_posix_acl_release
fs/ntfs3: Check for NULL if ATTR_EA_INFO is incorrect
fs/ntfs3: Refactoring of ntfs_init_from_boot
fs/ntfs3: Reject mount if boot's cluster size < media sector size
fs/ntfs3: Refactoring lock in ntfs_init_acl
fs/ntfs3: Change posix_acl_equiv_mode to posix_acl_update_mode
fs/ntfs3: Pass flags to ntfs_set_ea in ntfs_set_acl_ex
fs/ntfs3: Refactor ntfs_get_acl_ex for better readability
...


14cfbb7a 14-Oct-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix wrong condition to grab uring lock

Grab uring lock when we are in io-worker rather than in the original
or system-wq context since we already hold it in these two situation.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Fixes: b66ceaf324b3 ("io_uring: move iopoll reissue into regular IO path")
Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8607954c 11-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Check for NULL pointers in ni_try_remove_attr_list

Check for potential NULL pointers.
Print error message if found.
Thread, that leads to this commit:
https://lore.kernel.org/ntfs3/227c13e3-5a22-0cba-41eb-fcaf41940711@paragon-software.com/

Reported-by: Mohammad Rasim <mohammad.rasim96@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

1986c10a 11-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few more error handling fixes, stemming from code inspection, error
injection or fuzzing"

* tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix abort logic in btrfs_replace_file_extents
btrfs: check for error when looking up inode during dir entry replay
btrfs: unify lookup return value when dir entry is missing
btrfs: deal with errors when adding inode reference during log replay
btrfs: deal with errors when replaying dir entry during log replay
btrfs: deal with errors when checking if a dir entry exists during log replay
btrfs: update refs for any root except tree log roots
btrfs: unlock newly allocated extent buffer after error


22b05f1a 05-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactor ntfs_read_mft

Don't save size of attribute reparse point as size of symlink.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

cd4c76ff 05-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactor ni_parse_reparse

Change argument from void* to struct REPARSE_DATA_BUFFER*
We copy data to buffer, so we can read it later in ntfs_read_mft.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

14a98119 04-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactor ntfs_create_inode

Set size for symlink, so we don't need to calculate it on the fly.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

4dbe8e44 04-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactor ntfs_readlink_hlp

Rename some variables.
Returned err by default is EINVAL.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

2c690788 04-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Rework ntfs_utf16_to_nls

Now ntfs_utf16_to_nls takes length as one of arguments.
If length of symlink > 255, then we tried to convert
length of symlink +- some random number.
Now 255 symbols limit was removed.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9b75450d 28-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Fix memory leak if fill_super failed

In ntfs_init_fs_context we allocate memory in fc->s_fs_info.
In case of failed mount we must free it in ntfs_fill_super.
We can't do it in ntfs_fs_free, because ntfs_fs_free called
with fc->s_fs_info == NULL.
fc->s_fs_info became NULL in sget_fc.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

ce46ae0c 01-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Keep prealloc for all types of files

Before we haven't kept prealloc for sparse files because we thought that
it will speed up create / write operations.
It lead to situation, when user reserved some space for sparse file,
filled volume, and wasn't able to write in reserved file.
With this commit we keep prealloc.
Now xfstest generic/274 pass.
Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations")

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

c75de845 09-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc4-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:
"Six fixes for the ksmbd kernel server, including two additional
overflow checks, a fix for oops, and some cleanup (e.g. remove dead
code for less secure dialects that has been removed)"

* tag '5.15-rc4-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix oops from fuse driver
ksmbd: fix version mismatch with out of tree
ksmbd: use buf_data_size instead of recalculation in smb3_decrypt_req()
ksmbd: remove the leftover of smb2.0 dialect support
ksmbd: check strictly data area in ksmbd_smb2_check_message()
ksmbd: add the check to vaildate if stream protocol length exceeds maximum value


1da38549 07-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
"Bug fixes for NFSD error handling paths"

* tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Keep existing listeners on portlist error
SUNRPC: fix sign error causing rpcsec_gss drops
nfsd: Fix a warning for nfsd_file_close_inode
nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero
nfsd: fix error handling of register_pernet_subsys() in init_nfsd()


4afb912f 05-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: fix abort logic in btrfs_replace_file_extents

Error injection testing uncovered a case where we'd end up with a
corrupt file system with a missing extent in the middle of a file. This
occurs because the if statement to decide if we should abort is wrong.

The only way we would abort in this case is if we got a ret !=
-EOPNOTSUPP and we called from the file clone code. However the
prealloc code uses this path too. Instead we need to abort if there is
an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
if we came from the clone file code.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cfd31269 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: check for error when looking up inode during dir entry replay

At replay_one_name(), we are treating any error from btrfs_lookup_inode()
as if the inode does not exists. Fix this by checking for an error and
returning it to the caller.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8dcbc261 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: unify lookup return value when dir entry is missing

btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
entries and both are used during log replay or when updating a log tree
during an unlink.

However when the dir item does not exists, btrfs_lookup_dir_item() returns
NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
the dir item exists but there is no matching entry for a given name or
index, both return NULL. This makes the call sites during log replay to
be more verbose than necessary and it makes it easy to miss this slight
difference. Since we don't need to distinguish between those two cases,
make btrfs_lookup_dir_index_item() always return NULL when there is no
matching directory entry - either because there isn't any dir entry or
because there is one but it does not match the given name and index.

Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
'index' since it is supposed to match an index number, and the name
'objectid' is not very good because it can easily be confused with an
inode number (like the inode number a dir entry points to).

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

52db7779 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: deal with errors when adding inode reference during log replay

At __inode_add_ref(), we treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.

So fix that and return the error to the caller when it is not -ENOENT.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e15ac641 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: deal with errors when replaying dir entry during log replay

At replay_one_one(), we are treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.

So fix that and return the error to the caller when it is not -ENOENT.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

77a5b9e3 01-Oct-2021 Filipe Manana <fdmanana@suse.com>

btrfs: deal with errors when checking if a dir entry exists during log replay

Currently inode_in_dir() ignores errors returned from
btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating
any errors as if the directory entry does not exists in the fs/subvolume
tree, which is obviously not correct, as we can get errors such as -EIO
when reading extent buffers while searching the fs/subvolume's tree.

Fix that by making inode_in_dir() return the errors and making its only
caller, add_inode_ref(), deal with returned errors as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d175209b 01-Oct-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: update refs for any root except tree log roots

I hit a stuck relocation on btrfs/061 during my overnight testing. This
turned out to be because we had left over extent entries in our extent
root for a data reloc inode that no longer existed. This happened
because in btrfs_drop_extents() we only update refs if we have SHAREABLE
set or we are the tree_root. This regression was introduced by
aeb935a45581 ("btrfs: don't set SHAREABLE flag for data reloc tree")
where we stopped setting SHAREABLE for the data reloc tree.

The problem here is we actually do want to update extent references for
data extents in the data reloc tree, in fact we only don't want to
update extent references if the file extents are in the log tree.
Update this check to only skip updating references in the case of the
log tree.

This is relatively rare, because you have to be running scrub at the
same time, which is what btrfs/061 does. The data reloc inode has its
extents pre-allocated, and then we copy the extent into the
pre-allocated chunks. We theoretically should never be calling
btrfs_drop_extents() on a data reloc inode. The exception of course is
with scrub, if our pre-allocated extent falls inside of the block group
we are scrubbing, then the block group will be marked read only and we
will be forced to cow that extent. This means we will call
btrfs_drop_extents() on that range when we COW that file extent.

This isn't really problematic if we do this, the data reloc inode
requires that our extent lengths match exactly with the extent we are
copying, thankfully we validate the extent is correct with
get_new_location(), so if we happen to COW only part of the extent we
won't link it in when we do the relocation, so we are safe from any
other shenanigans that arise because of this interaction with scrub.

Fixes: aeb935a45581 ("btrfs: don't set SHAREABLE flag for data reloc tree")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

19ea40dd 14-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: unlock newly allocated extent buffer after error

[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

BTRFS info (device loop0): enabling ssd optimizations
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
fail_dump lib/fault-inject.c:52 [inline]
should_fail+0x13c/0x160 lib/fault-inject.c:146
should_failslab+0x5/0x10 mm/slab_common.c:1328
slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
slab_alloc_node mm/slub.c:3120 [inline]
slab_alloc mm/slub.c:3214 [inline]
kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
__btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
lookup_open+0x660/0x780 fs/namei.c:3282
open_last_lookups fs/namei.c:3352 [inline]
path_openat+0x465/0xe20 fs/namei.c:3557
do_filp_open+0xe3/0x170 fs/namei.c:3588
do_sys_openat2+0x357/0x4a0 fs/open.c:1200
do_sys_open+0x87/0xd0 fs/open.c:1216
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x46ae99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

================================================
WARNING: lock held when returning to user space!
5.15.0-rc1 #16 Not tainted
------------------------------------------------
syz-executor/7579 is leaving the kernel with locks still held!
1 lock held by syz-executor/7579:
#0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
__btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7041503d 07-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfslib, cachefiles and afs fixes from David Howells:

- Fix another couple of oopses in cachefiles tracing stemming from the
possibility of passing in a NULL object pointer

- Fix netfs_clear_unread() to set READ on the iov_iter so that source
it is passed to doesn't do the wrong thing (some drivers look at the
flag on iov_iter rather than other available information to determine
the direction)

- Fix afs_launder_page() to write back at the correct file position on
the server so as not to corrupt data

* tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix afs_launder_page() to set correct start file position
netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()
cachefiles: Fix oops with cachefiles_cull() due to NULL object


64e78755 02-Oct-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix oops from fuse driver

Marios reported kernel oops from fuse driver when ksmbd call
mark_inode_dirty(). This patch directly update ->i_ctime after removing
mark_inode_ditry() and notify_change will put inode to dirty list.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Reported-by: Marios Makassikis <mmakassikis@freebox.fr>
Tested-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

2db72604 30-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix version mismatch with out of tree

Fix version mismatch with out of tree, This updated version will be
matched with ksmbd-tools.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

c7705eec 04-Oct-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: use buf_data_size instead of recalculation in smb3_decrypt_req()

Tom suggested to use buf_data_size that is already calculated, to verify
these offsets.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Suggested-by: Tom Talpey <tom@talpey.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

51a13873 28-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: remove the leftover of smb2.0 dialect support

Although ksmbd doesn't send SMB2.0 support in supported dialect list of smb
negotiate response, There is the leftover of smb2.0 dialect.
This patch remove it not to support SMB2.0 in ksmbd.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

c2e99d47 26-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: check strictly data area in ksmbd_smb2_check_message()

When invalid data offset and data length in request,
ksmbd_smb2_check_message check strictly and doesn't allow to process such
requests.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

c2010694 06-Oct-2021 Benjamin Coddington <bcodding@redhat.com>

NFSD: Keep existing listeners on portlist error

If nfsd has existing listening sockets without any processes, then an error
returned from svc_create_xprt() for an additional transport will remove
those existing listeners. We're seeing this in practice when userspace
attempts to create rpcrdma transports without having the rpcrdma modules
present before creating nfsd kernel processes. Fix this by checking for
existing sockets before calling nfsd_destroy().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

36399990 23-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add the check to vaildate if stream protocol length exceeds maximum value

This patch add MAX_STREAM_PROT_LEN macro and check if stream protocol
length exceeds maximum value. opencode pdu size check in
ksmbd_pdu_size_has_room().

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

60a94835 05-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'warning-fixes-20211005' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc fs warning fixes from David Howells:
"The first four patches fix kerneldoc warnings in fscache, afs, 9p and
nfs - they're mostly just comment changes, though there's one place in
9p where a comment got detached from the function it was attached to
(v9fs_fid_add) and has to switch places with a function that got
inserted between (__add_fid).

The patch on the end removes an unused symbol in fscache"

* tag 'warning-fixes-20211005' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Remove an unused static variable
fscache: Fix some kerneldoc warnings shown up by W=1
9p: Fix a bunch of kerneldoc warnings shown up by W=1
afs: Fix kerneldoc warning shown up by W=1
nfs: Fix kerneldoc warning shown up by W=1


95dd8b2c 05-Oct-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Remove unnecessary functions

We don't need ntfs_xattr_get_acl and ntfs_xattr_set_acl.
There are ntfs_get_acl_ex and ntfs_set_acl_ex.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

8241fffa 28-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Forbid FALLOC_FL_PUNCH_HOLE for normal files

FALLOC_FL_PUNCH_HOLE isn't allowed with normal files.
Filesystem must remember info about hole, but for normal file
we can only zero it and forget.

Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation")
Now xfstests generic/016 generic/021 generic/022 pass.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

cff32466 24-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactoring of ntfs_set_ea

Make code more readable.
Don't try to read zero bytes.
Add warning when size of exteneded attribute exceeds limit.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d81e06be 24-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Remove locked argument in ntfs_set_ea

We always need to lock now, because locks became smaller
(see d562e901f25d
"fs/ntfs3: Move ni_lock_dir and ni_unlock into ntfs_create_inode").

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b1e0c55a 24-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Use available posix_acl_release instead of ntfs_posix_acl_release

We don't need to maintain ntfs_posix_acl_release.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

5c052248 12-Aug-2021 David Howells <dhowells@redhat.com>

afs: Fix afs_launder_page() to set correct start file position

Fix afs_launder_page() to set the starting position of the StoreData RPC at
the offset into the page at which the modified data starts instead of at
the beginning of the page (the iov_iter is correctly offset).

The offset got lost during the conversion to passing an iov_iter into
afs_store_data().

Changes:
ver #2:
- Use page_offset() rather than manually calculating it[1].

Fixes: bd80d8a80e12 ("afs: Use ITER_XARRAY for writing")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YST/0e92OdSH0zjg@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/162880783179.3421678.7795105718190440134.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162937512409.1449272.18441473411207824084.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162981148752.1901565.3663780601682206026.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163005741670.2472992.2073548908229887941.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163221839087.3143591.14278359695763025231.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163292980654.4004896.7134735179887998551.stgit@warthog.procyon.org.uk/ # v2

330de47d 26-Jul-2021 David Howells <dhowells@redhat.com>

netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()

Fix netfs_clear_unread() to pass READ to iov_iter_xarray() instead of WRITE
(the flag is about the operation accessing the buffer, not what sort of
access it is doing to the buffer).

Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/162729351325.813557.9242842205308443901.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162886603464.3940407.3790841170414793899.stgit@warthog.procyon.org.uk
Link: https://lore.kernel.org/r/163239074602.1243337.14154704004485867017.stgit@warthog.procyon.org.uk

ef31499a 20-Sep-2021 David Howells <dhowells@redhat.com>

fscache: Remove an unused static variable

The fscache object CREATE_OBJECT work state isn't ever referred to, so
remove it and avoid the unused variable warning caused by W=1.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2

d9e3f822 04-Oct-2021 David Howells <dhowells@redhat.com>

fscache: Fix some kerneldoc warnings shown up by W=1

Fix some kerneldoc warnings in the fscache driver that are shown up by W=1.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Mauro Carvalho Chehab <mchehab@kernel.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2

bc868036 04-Oct-2021 David Howells <dhowells@redhat.com>

9p: Fix a bunch of kerneldoc warnings shown up by W=1

Fix a bunch of kerneldoc warnings shown up by W=1 in the 9p filesystem:

(1) Add/remove/fix kerneldoc parameters descriptions.

(2) Move __add_fid() from between v9fs_fid_add() and its comment.

(3) 9p's caches_show() doesn't really make sense as an API function, so
remove the kerneldoc annotation. It's also not prefixed with 'v9fs_'.
Also remove the kerneldoc markers from the 9p fscache wrappers.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Mauro Carvalho Chehab <mchehab@kernel.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2

dcb442b1 04-Oct-2021 David Howells <dhowells@redhat.com>

afs: Fix kerneldoc warning shown up by W=1

Fix a kerneldoc warning in afs due to a partially documented internal
function by removing the kerneldoc marker.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2

c0b27c48 04-Oct-2021 David Howells <dhowells@redhat.com>

nfs: Fix kerneldoc warning shown up by W=1

Fix a kerneldoc warning in nfs due to documentation for a parameter that
isn't present.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: Mauro Carvalho Chehab <mchehab@kernel.org>
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2

b60be028 04-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ovl-fixes-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs, both of them corner cases not affecting most users"

* tag 'ovl-fixes-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix IOCB_DIRECT if underlying fs doesn't support direct IO
ovl: fix missing negative dentry check in ovl_rename()


410d591a 03-Oct-2021 Ian Kent <raven@themaw.net>

kernfs: don't create a negative dentry if inactive node exists

It's been reported that doing stress test for module insertion and
removal can result in an ENOENT from libkmod for a valid module.

In kernfs_iop_lookup() a negative dentry is created if there's no kernfs
node associated with the dentry or the node is inactive.

But inactive kernfs nodes are meant to be invisible to the VFS and
creating a negative dentry for these can have unexpected side effects
when the node transitions to an active state.

The point of creating negative dentries is to avoid the expensive
alloc/free cycle that occurs if there are frequent lookups for kernfs
attributes that don't exist. So kernfs nodes that are not yet active
should not result in a negative dentry being created so when they
transition to an active state VFS lookups can create an associated
dentry is a natural way.

It's also been reported that https://github.com/osandov/blktests.git
test block/001 hangs during the test. It was suggested that recent
changes to blktests might have caused it but applying this patch
resolved the problem without change to blktests.

Fixes: c7e7c04274b1 ("kernfs: use VFS negative dentry caching")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
ACKed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/163330943316.19450.15056895533949392922.stgit@mickey.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

9b2f72cc 28-Sep-2021 Chen Jingwen <chenjingwen6@huawei.com>

elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings

In commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.

Unfortunately, this will cause kernel to fail to start with:

1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)

The reason is that the elf interpreter (ld.so) has overlapping segments.

readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000

The reason for this problem is the same as described in commit
ad55eac74f20 ("elf: enforce MAP_FIXED on overlaying elf segments").

Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.

Fixes: 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ca3cef46 03-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
"Fix a number of ext4 bugs in fast_commit, inline data, and delayed
allocation.

Also fix error handling code paths in ext4_dx_readdir() and
ext4_fill_super().

Finally, avoid a grabbing a journal head in the delayed allocation
write in the common cases where we are overwriting a pre-existing
block or appending to an inode"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: recheck buffer uptodate bit under buffer lock
ext4: fix potential infinite loop in ext4_dx_readdir()
ext4: flush s_error_work before journal destroy in ext4_fill_super
ext4: fix loff_t overflow in ext4_max_bitmap_size()
ext4: fix reserved space counter leakage
ext4: limit the number of blocks in one ADD_RANGE TLV
ext4: enforce buffer head state assertion in ext4_da_map_blocks
ext4: remove extent cache entries when truncating inline data
ext4: drop unnecessary journal handle in delalloc write
ext4: factor out write end code of inline file
ext4: correct the error path of ext4_write_inline_data_end()
ext4: check and update i_disksize properly
ext4: add error checking to ext4_ext_replay_set_iblocks()


84928ce3 03-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
"Here are some driver core and kernfs fixes for reported issues for
5.15-rc4. These fixes include:

- kernfs positive dentry bugfix

- debugfs_create_file_size error path fix

- cpumask sysfs file bugfix to preserve the user/kernel abi (has been
reported multiple times.)

- devlink fixes for mdiobus devices as reported by the subsystem
maintainers.

Also included in here are some devlink debugging changes to make it
easier for people to report problems when asked. They have already
helped with the mdiobus and other subsystems reporting issues.

All of these have been linux-next for a while with no reported issues"

* tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: also call kernfs_set_rev() for positive dentry
driver core: Add debug logs when fwnode links are added/deleted
driver core: Create __fwnode_link_del() helper function
driver core: Set deferred probe reason when deferred by driver core
net: mdiobus: Set FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD for mdiobus parents
driver core: fw_devlink: Add support for FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD
driver core: fw_devlink: Improve handling of cyclic dependencies
cpumask: Omit terminating null byte in cpumap_print_{list,bitmask}_to_buf
debugfs: debugfs_create_file_size(): use IS_ERR to check for error


e25ca045 02-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:
"Eleven fixes for the ksmbd kernel server, mostly security related:

- an important fix for disabling weak NTLMv1 authentication

- seven security (improved buffer overflow checks) fixes

- fix for wrong infolevel struct used in some getattr/setattr paths

- two small documentation fixes"

* tag '5.15-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: missing check for NULL in convert_to_nt_pathname()
ksmbd: fix transform header validation
ksmbd: add buffer validation for SMB2_CREATE_CONTEXT
ksmbd: add validation in smb2 negotiate
ksmbd: add request buffer validation in smb2_set_info
ksmbd: use correct basic info level in set_file_basic_info()
ksmbd: remove NTLMv1 authentication
ksmbd: fix documentation for 2 functions
MAINTAINERS: rename cifs_common to smbfs_common in cifs and ksmbd entry
ksmbd: fix invalid request buffer access in compound
ksmbd: remove RFC1002 check in smb2 request


65893b49 02-Oct-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-10-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"Two fixes in here:

- The signal issue that was discussed start of this week (me).

- Kill dead fasync support in io_uring. Looks like it was broken
since io_uring was initially merged, and given that nobody has ever
complained about it, let's just kill it (Pavel)"

* tag 'io_uring-5.15-2021-10-01' of git://git.kernel.dk/linux-block:
io_uring: kill fasync
io-wq: exclusively gate signal based exit on get_signal() return


3f008385 01-Oct-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill fasync

We have never supported fasync properly, it would only fire when there
is something polling io_uring making it useless. The original support came
in through the initial io_uring merge for 5.1. Since it's broken and
nobody has reported it, get rid of the fasync bits.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

19598141 30-Sep-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

nfsd: Fix a warning for nfsd_file_close_inode

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f2c77973 10-Sep-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: recheck buffer uptodate bit under buffer lock

Commit 8e33fadf945a ("ext4: remove an unnecessary if statement in
__ext4_get_inode_loc()") forget to recheck buffer's uptodate bit again
under buffer lock, which may overwrite the buffer if someone else have
already brought it uptodate and changed it.

Fixes: 8e33fadf945a ("ext4: remove an unnecessary if statement in __ext4_get_inode_loc()")
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210910080316.70421-1-yi.zhang@huawei.com

42cb4474 14-Sep-2021 yangerkun <yangerkun@huawei.com>

ext4: fix potential infinite loop in ext4_dx_readdir()

When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an
infinite loop since if info->last_pos != ctx->pos this will reset the
directory scan and reread the failing entry. For example:

1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as
leaf block which own the ext4_dir_entry_2
2. block 1 read ok and call_filldir which will fill the dirent and update
the ctx->pos
3. block 2 read fail, but we has already fill some dirent, so we will
return back to userspace will a positive return val(see ksys_getdents64)
4. the second ext4_dx_readdir will reset the world since info->last_pos
!= ctx->pos, and will also init the curr_hash which pos to block 1
5. So we will read block1 too, and once block2 still read fail, we can
only fill one dirent because the hash of the entry in block1(besides
the last one) won't greater than curr_hash
6. this time, we forget update last_pos too since the read for block2
will fail, and since we has got the one entry, ksys_getdents64 can
return success
7. Latter we will trapped in a loop with step 4~6

Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com

bb9464e0 24-Sep-2021 yangerkun <yangerkun@huawei.com>

ext4: flush s_error_work before journal destroy in ext4_fill_super

The error path in ext4_fill_super forget to flush s_error_work before
journal destroy, and it may trigger the follow bug since
flush_stashed_error_work can run concurrently with journal destroy
without any protection for sbi->s_journal.

[32031.740193] EXT4-fs (loop66): get root inode failed
[32031.740484] EXT4-fs (loop66): mount failed
[32031.759805] ------------[ cut here ]------------
[32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
[32031.760075] invalid opcode: 0000 [#1] SMP PTI
[32031.760336] CPU: 5 PID: 1029268 Comm: kworker/5:1 Kdump: loaded
4.18.0
[32031.765112] Call Trace:
[32031.765375] ? __switch_to_asm+0x35/0x70
[32031.765635] ? __switch_to_asm+0x41/0x70
[32031.765893] ? __switch_to_asm+0x35/0x70
[32031.766148] ? __switch_to_asm+0x41/0x70
[32031.766405] ? _cond_resched+0x15/0x40
[32031.766665] jbd2__journal_start+0xf1/0x1f0 [jbd2]
[32031.766934] jbd2_journal_start+0x19/0x20 [jbd2]
[32031.767218] flush_stashed_error_work+0x30/0x90 [ext4]
[32031.767487] process_one_work+0x195/0x390
[32031.767747] worker_thread+0x30/0x390
[32031.768007] ? process_one_work+0x390/0x390
[32031.768265] kthread+0x10d/0x130
[32031.768521] ? kthread_flush_work_fn+0x10/0x10
[32031.768778] ret_from_fork+0x35/0x40

static int start_this_handle(...)
BUG_ON(journal->j_flags & JBD2_UNMOUNT); <---- Trigger this

Besides, after we enable fast commit, ext4_fc_replay can add work to
s_error_work but return success, so the latter journal destroy in
ext4_load_journal can trigger this problem too.

Fix this problem with two steps:
1. Call ext4_commit_super directly in ext4_handle_error for the case
that called from ext4_fc_replay
2. Since it's hard to pair the init and flush for s_error_work, we'd
better add a extras flush_work before journal destroy in
ext4_fill_super

Besides, this patch will call ext4_commit_super in ext4_handle_error for
any nojournal case too. But it seems safe since the reason we call
schedule_work was that we should save error info to sb through journal
if available. Conversely, for the nojournal case, it seems useless delay
commit superblock to s_error_work.

Fixes: c92dc856848f ("ext4: defer saving error info from atomic context")
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210924093917.1953239-1-yangerkun@huawei.com

75ca6ad4 04-Jun-2021 Ritesh Harjani <riteshh@linux.ibm.com>

ext4: fix loff_t overflow in ext4_max_bitmap_size()

We should use unsigned long long rather than loff_t to avoid
overflow in ext4_max_bitmap_size() for comparison before returning.
w/o this patch sbi->s_bitmap_maxbytes was becoming a negative
value due to overflow of upper_limit (with has_huge_files as true)

Below is a quick test to trigger it on a 64KB pagesize system.

sudo mkfs.ext4 -b 65536 -O ^has_extents,^64bit /dev/loop2
sudo mount /dev/loop2 /mnt
sudo echo "hello" > /mnt/hello -> This will error out with
"echo: write error: File too large"

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/594f409e2c543e90fd836b78188dfa5c575065ba.1622867594.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

6fed8395 23-Aug-2021 Jeffle Xu <jefflexu@linux.alibaba.com>

ext4: fix reserved space counter leakage

When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should. One effect
of this bug is that s_dirtyclusters_counter is not decremented and
remains incorrectly elevated until the file system has been unmounted.
This can result in premature ENOSPC returns and apparent loss of free
space.

Another effect of this bug is that
/sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even
after syncfs has been executed on the filesystem.

Besides, add check for s_dirtyclusters_counter when inode is going to be
evicted and freed. s_dirtyclusters_counter can still keep non-zero until
inode is written back in .evict_inode(), and thus the check is delayed
to .destroy_inode().

Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
Cc: stable@kernel.org
Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com

a2c2f082 19-Aug-2021 Hou Tao <houtao1@huawei.com>

ext4: limit the number of blocks in one ADD_RANGE TLV

Now EXT4_FC_TAG_ADD_RANGE uses ext4_extent to track the
newly-added blocks, but the limit on the max value of
ee_len field is ignored, and it can lead to BUG_ON as
shown below when running command "fallocate -l 128M file"
on a fast_commit-enabled fs:

kernel BUG at fs/ext4/ext4_extents.h:199!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 624 Comm: fallocate Not tainted 5.14.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:ext4_fc_write_inode_data+0x1f3/0x200
Call Trace:
? ext4_fc_write_inode+0xf2/0x150
ext4_fc_commit+0x93b/0xa00
? ext4_fallocate+0x1ad/0x10d0
ext4_sync_file+0x157/0x340
? ext4_sync_file+0x157/0x340
vfs_fsync_range+0x49/0x80
do_fsync+0x3d/0x70
__x64_sys_fsync+0x14/0x20
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

Simply fixing it by limiting the number of blocks
in one EXT4_FC_TAG_ADD_RANGE TLV.

Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210820044505.474318-1-houtao1@huawei.com

87ffb310 30-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

ksmbd: missing check for NULL in convert_to_nt_pathname()

The kmalloc() does not have a NULL check. This code can be re-written
slightly cleaner to just use the kstrdup().

Fixes: 265fd1991c1d ("ksmbd: use LOOKUP_BENEATH to prevent the out of share access")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

f2e717d6 30-Sep-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero

RFC3530 notes that the 'dircount' field may be zero, in which case the
recommendation is to ignore it, and only enforce the 'maxcount' field.
In RFC5661, this recommendation to ignore a zero valued field becomes a
requirement.

Fixes: aee377644146 ("nfsd4: fix rd_dircount enforcement")
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

35afb70d 29-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Check for NULL if ATTR_EA_INFO is incorrect

This can be reason for reported panic
https://lore.kernel.org/ntfs3/f9de5807-2311-7374-afb0-bc5dffb522c0@gmail.com/
Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation")

Reported-by: Mohammad Rasim <mohammad.rasim96@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

dbf59e2a 28-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactoring of ntfs_init_from_boot

Remove ntfs_sb_info members sector_size and sector_bits.
Print details why mount failed.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

09f7c338 28-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Reject mount if boot's cluster size < media sector size

If we continue to work in this case, then we can corrupt fs.
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block").

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

1d625050 21-Aug-2021 Patrick Ho <Patrick.Ho@netapp.com>

nfsd: fix error handling of register_pernet_subsys() in init_nfsd()

init_nfsd() should not unregister pernet subsys if the register fails
but should instead unwind from the last successful operation which is
register_filesystem().

Unregistering a failed register_pernet_subsys() call can result in
a kernel GPF as revealed by programmatically injecting an error in
register_pernet_subsys().

Verified the fix handled failure gracefully with no lingering nfsd
entry in /proc/filesystems. This change was introduced by the commit
bd5ae9288d64 ("nfsd: register pernet ops last, unregister first"),
the original error handling logic was correct.

Fixes: bd5ae9288d64 ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Patrick Ho <Patrick.Ho@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4227f811 29-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix transform header validation

Validate that the transform and smb request headers are present
before checking OriginalMessageSize and SessionId fields.

Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

8f77150c 24-Sep-2021 Hyunchul Lee <hyc.lee@gmail.com>

ksmbd: add buffer validation for SMB2_CREATE_CONTEXT

Add buffer validation for SMB2_CREATE_CONTEXT.

Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

442ff9eb 29-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add validation in smb2 negotiate

This patch add validation to check request buffer check in smb2
negotiate and fix null pointer deferencing oops in smb3_preauth_hash_rsp()
that found from manual test.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

9496e268 29-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add request buffer validation in smb2_set_info

Add buffer validation in smb2_set_info, and remove unused variable
in set_file_basic_info. and smb2_set_info infolevel functions take
structure pointer argument.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

88d30052 29-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: use correct basic info level in set_file_basic_info()

Use correct basic info level in set/get_file_basic_info().

Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

ce812992 27-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: remove NTLMv1 authentication

Remove insecure NTLMv1 authentication.

Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

1018bf24 28-Sep-2021 Enzo Matsumiya <ematsumiya@suse.de>

ksmbd: fix documentation for 2 functions

ksmbd_kthread_fn() and create_socket() returns 0 or error code, and not
task_struct/ERR_PTR.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

df38d852 28-Sep-2021 Hou Tao <houtao1@huawei.com>

kernfs: also call kernfs_set_rev() for positive dentry

A KMSAN warning is reported by Alexander Potapenko:

BUG: KMSAN: uninit-value in kernfs_dop_revalidate+0x61f/0x840
fs/kernfs/dir.c:1053
kernfs_dop_revalidate+0x61f/0x840 fs/kernfs/dir.c:1053
d_revalidate fs/namei.c:854
lookup_dcache fs/namei.c:1522
__lookup_hash+0x3a6/0x590 fs/namei.c:1543
filename_create+0x312/0x7c0 fs/namei.c:3657
do_mkdirat+0x103/0x930 fs/namei.c:3900
__do_sys_mkdir fs/namei.c:3931
__se_sys_mkdir fs/namei.c:3929
__x64_sys_mkdir+0xda/0x120 fs/namei.c:3929
do_syscall_x64 arch/x86/entry/common.c:51

It seems a positive dentry in kernfs becomes a negative dentry directly
through d_delete() in vfs_rmdir(). dentry->d_time is uninitialized
when accessing it in kernfs_dop_revalidate(), because it is only
initialized when created as negative dentry in kernfs_iop_lookup().

The problem can be reproduced by the following command:

cd /sys/fs/cgroup/pids && mkdir hi && stat hi && rmdir hi && stat hi

A simple fixes seems to be initializing d->d_time for positive dentry
in kernfs_iop_lookup() as well. The downside is the negative dentry
will be revalidated again after it becomes negative in d_delete(),
because the revison of its parent must have been increased due to
its removal.

Alternative solution is implement .d_iput for kernfs, and assign d_time
for the newly-generated negative dentry in it. But we may need to
take kernfs_rwsem to protect again the concurrent kernfs_link_sibling()
on the parent directory, it is a little over-killing. Now the simple
fix is chosen.

Link: https://marc.info/?l=linux-fsdevel&m=163249838610499
Fixes: c7e7c04274b1 ("kernfs: use VFS negative dentry caching")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20210928140750.1274441-1-houtao1@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

6fd3ec5c 28-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity fix from Eric Biggers:
"Fix an integer overflow when computing the Merkle tree layout of
extremely large files, exposed by btrfs adding support for fs-verity"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: fix signed integer overflow with i_size near S64_MAX


1dc1eed4 27-Sep-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: fix IOCB_DIRECT if underlying fs doesn't support direct IO

Normally the check at open time suffices, but e.g loop device does set
IOCB_DIRECT after doing its own checks (which are not sufficent for
overlayfs).

Make sure we don't call the underlying filesystem read/write method with
the IOCB_DIRECT if it's not supported.

Reported-by: Huang Jianan <huangjianan@oppo.com>
Fixes: 16914e6fc7e1 ("ovl: add ovl_read_iter()")
Cc: <stable@vger.kernel.org> # v4.19
Tested-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

9b3b353e 27-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

vboxfs: fix broken legacy mount signature checking

Commit 9d682ea6bcc7 ("vboxsf: Fix the check for the old binary
mount-arguments struct") was meant to fix a build error due to sign
mismatch in 'char' and the use of character constants, but it just moved
the error elsewhere, in that on some architectures characters and signed
and on others they are unsigned, and that's just how the C standard
works.

The proper fix is a simple "don't do that then". The code was just
being silly and odd, and it should never have cared about signed vs
unsigned characters in the first place, since what it is testing is not
four "characters", but four bytes.

And the way to compare four bytes is by using "memcmp()".

Which compilers will know to just turn into a single 32-bit compare with
a constant, as long as you don't have crazy debug options enabled.

Link: https://lore.kernel.org/lkml/20210927094123.576521-1-arnd@kernel.org/
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

78f8876c 27-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: exclusively gate signal based exit on get_signal() return

io-wq threads block all signals, except SIGKILL and SIGSTOP. We should not
need any extra checking of signal_pending or fatal_signal_pending, rely
exclusively on whether or not get_signal() tells us to exit.

The original debugging of this issue led to the false positive that we
were exiting on non-fatal signals, but that is not the case. The issue
was around races with nr_workers accounting.

Fixes: 87c169665578 ("io-wq: ensure we exit if thread group is exiting")
Fixes: 15e20db2e0ce ("io-wq: only exit on fatal signals")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d72a9c15 23-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix invalid request buffer access in compound

Ronnie reported invalid request buffer access in chained command when
inserting garbage value to NextCommand of compound request.
This patch add validation check to avoid this issue.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Tested-by: Steve French <smfrench@gmail.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

18d46769 20-Sep-2021 Ronnie Sahlberg <lsahlber@redhat.com>

ksmbd: remove RFC1002 check in smb2 request

In smb_common.c you have this function : ksmbd_smb_request() which
is called from connection.c once you have read the initial 4 bytes for
the next length+smb2 blob.

It checks the first byte of this 4 byte preamble for valid values,
i.e. a NETBIOSoverTCP SESSION_MESSAGE or a SESSION_KEEP_ALIVE.

We don't need to check this for ksmbd since it only implements SMB2
over TCP port 445.
The netbios stuff was only used in very old servers when SMB ran over
TCP port 139.
Now that we run over TCP port 445, this is actually not a NB header anymore
and you can just treat it as a 4 byte length field that must be less
than 16Mbyte. and remove the references to the RFC1002 constants that no
longer applies.

Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

5e5d7597 26-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:
"Five fixes for the ksmbd kernel server, including three security
fixes:

- remove follow symlinks support

- use LOOKUP_BENEATH to prevent out of share access

- SMB3 compounding security fix

- fix for returning the default streams correctly, fixing a bug when
writing ppt or doc files from some clients

- logging more clearly that ksmbd is experimental (at module load
time)"

* tag '5.15-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: use LOOKUP_BENEATH to prevent the out of share access
ksmbd: remove follow symlinks support
ksmbd: check protocol id in ksmbd_verify_smb_message()
ksmbd: add default data stream name in FILE_STREAM_INFORMATION
ksmbd: log that server is experimental at module load


a3b397b4 25-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"16 patches.

Subsystems affected by this patch series: xtensa, sh, ocfs2, scripts,
lib, and mm (memory-failure, kasan, damon, shmem, tools, pagecache,
debug, and pagemap)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: fix uninitialized use in overcommit_policy_handler
mm/memory_failure: fix the missing pte_unmap() call
kasan: always respect CONFIG_KASAN_STACK
sh: pgtable-3level: fix cast to pointer from integer of different size
mm/debug: sync up latest migrate_reason to migrate_reason_names
mm/debug: sync up MR_CONTIG_RANGE and MR_LONGTERM_PIN
mm: fs: invalidate bh_lrus for only cold path
lib/zlib_inflate/inffast: check config in C to avoid unused function warning
tools/vm/page-types: remove dependency on opt_file for idle page tracking
scripts/sorttable: riscv: fix undeclared identifier 'EM_RISCV' error
ocfs2: drop acl cache for directories too
mm/shmem.c: fix judgment error in shmem_is_huge()
xtensa: increase size of gcc stack frame check
mm/damon: don't use strnlen() with known-bogus source length
kasan: fix Kconfig check of CC_HAS_WORKING_NOSANITIZE_ADDRESS
mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()


f6f360ae 25-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-09-25' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"This one looks a bit bigger than it is, but that's mainly because 2/3
of it is enabling IORING_OP_CLOSE to close direct file descriptors.

We've had a few folks using them and finding it confusing that the way
to close them is through using -1 for file update, this just brings
API symmetry for direct descriptors. Hence I think we should just do
this now and have a better API for 5.15 release. There's some room for
de-duplicating the close code, but we're leaving that for the next
merge window.

Outside of that, just small fixes:

- Poll race fixes (Hao)

- io-wq core dump exit fix (me)

- Reschedule around potentially intensive tctx and buffer iterators
on teardown (me)

- Fix for always ending up punting files update to io-wq (me)

- Put the provided buffer meta data under memcg accounting (me)

- Tweak for io_write(), removing dead code that was added with the
iterator changes in this release (Pavel)"

* tag 'io_uring-5.15-2021-09-25' of git://git.kernel.dk/linux-block:
io_uring: make OP_CLOSE consistent with direct open
io_uring: kill extra checks in io_write()
io_uring: don't punt files update to io-wq unconditionally
io_uring: put provided buffer meta data under memcg accounting
io_uring: allow conditional reschedule for intensive iterators
io_uring: fix potential req refcount underflow
io_uring: fix missing set of EPOLLONESHOT for CQ ring overflow
io_uring: fix race between poll completion and cancel_hash insertion
io-wq: ensure we exit if thread group is exiting


a5e0aceab 25-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
"Two bugfixes to fix the 4KiB blockmap chunk format availability and a
dangling pointer usage. There is also a trivial cleanup to clarify
compacted_2b if compacted_4b_initial > totalidx.

Summary:

- fix the dangling pointer use in erofs_lookup tracepoint

- fix unsupported chunk format check

- zero out compacted_2b if compacted_4b_initial > totalidx"

* tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clear compacted_2b if compacted_4b_initial > totalidx
erofs: fix misbehavior of unsupported chunk format check
erofs: fix up erofs_lookup tracepoint


b8f42965 25-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Six small cifs/smb3 fixes, two for stable:

- important fix for deferred close (found by a git functional test)
related to attribute caching on close.

- four (two cosmetic, two more serious) small fixes for problems
pointed out by smatch via Dan Carpenter

- fix for comment formatting problems pointed out by W=1"

* tag '5.15-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix incorrect check for null pointer in header_assemble
smb3: correct server pointer dereferencing check to be more consistent
smb3: correct smb3 ACL security descriptor
cifs: Clear modified attribute bit from inode flags
cifs: Deal with some warnings from W=1
cifs: fix a sign extension bug


265fd199 24-Sep-2021 Hyunchul Lee <hyc.lee@gmail.com>

ksmbd: use LOOKUP_BENEATH to prevent the out of share access

instead of removing '..' in a given path, call
kern_path with LOOKUP_BENEATH flag to prevent
the out of share access.

ran various test on this:
smb2-cat-async smb://127.0.0.1/homes/../out_of_share
smb2-cat-async smb://127.0.0.1/homes/foo/../../out_of_share
smbclient //127.0.0.1/homes -c "mkdir ../foo2"
smbclient //127.0.0.1/homes -c "rename bar ../bar"

Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Boehme <slow@samba.org>
Tested-by: Steve French <smfrench@gmail.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

243418e3 24-Sep-2021 Minchan Kim <minchan@kernel.org>

mm: fs: invalidate bh_lrus for only cold path

The kernel test robot reported the regression of fio.write_iops[1] with
commit 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration").

Since lru_add_drain is called frequently, invalidate bh_lrus there could
increase bh_lrus cache miss ratio, which needs more IO in the end.

This patch moves the bh_lrus invalidation from the hot path( e.g.,
zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all,
lru_cache_disable).

Zhengjun Xing confirmed
"I test the patch, the regression reduced to -2.9%"

[1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/
[2] 8cc621d2f45d, mm: fs: invalidate BH LRU during page migration

Link: https://lkml.kernel.org/r/20210907212347.1977686-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Tested-by: "Xing, Zhengjun" <zhengjun.xing@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9c0f0a03 24-Sep-2021 Wengang Wang <wen.gang.wang@oracle.com>

ocfs2: drop acl cache for directories too

ocfs2_data_convert_worker() is currently dropping any cached acl info
for FILE before down-converting meta lock. It should also drop for
DIRECTORY. Otherwise the second acl lookup returns the cached one (from
VFS layer) which could be already stale.

The problem we are seeing is that the acl changes on one node doesn't
get refreshed on other nodes in the following case:

Node 1 Node 2
-------------- ----------------
getfacl dir1

getfacl dir1 <-- this is OK

setfacl -m u:user1:rwX dir1
getfacl dir1 <-- see the change for user1

getfacl dir1 <-- can't see change for user1

Link: https://lkml.kernel.org/r/20210903012631.6099-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7df778be 24-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: make OP_CLOSE consistent with direct open

From recently open/accept are now able to manipulate fixed file table,
but it's inconsistent that close can't. Close the gap, keep API same as
with open/accept, i.e. via sqe->file_slot.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a295aef6 23-Sep-2021 Zheng Liang <zhengliang6@huawei.com>

ovl: fix missing negative dentry check in ovl_rename()

The following reproducer

mkdir lower upper work merge
touch lower/old
touch lower/new
mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge
rm merge/new
mv merge/old merge/new & unlink upper/new

may result in this race:

PROCESS A:
rename("merge/old", "merge/new");
overwrite=true,ovl_lower_positive(old)=true,
ovl_dentry_is_whiteout(new)=true -> flags |= RENAME_EXCHANGE

PROCESS B:
unlink("upper/new");

PROCESS A:
lookup newdentry in new_upperdir
call vfs_rename() with negative newdentry and RENAME_EXCHANGE

Fix by adding the missing check for negative newdentry.

Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Fixes: e9be9d5e76e3 ("overlay filesystem")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

4c4f0c2b 24-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.15-rc3' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
"A fix for a potential array out of bounds access from Dan"

* tag 'ceph-for-5.15-rc3' of git://github.com/ceph/ceph-client:
ceph: fix off by one bugs in unsafe_request_wait()


e655c81a 24-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fixes_for_v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc filesystem fixes from Jan Kara:
"A for ext2 sleep in atomic context in case of some fs problems and a
cleanup of an invalidate_lock initialization"

* tag 'fixes_for_v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix sleeping in atomic bugs on error
mm: Fully initialize invalidate_lock, amend lock class later


9f3a2cb2 24-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill extra checks in io_write()

We don't retry short writes and so we would never get to async setup in
io_write() in that case. Thus ret2 > 0 is always false and
iov_iter_advance() is never used. Apparently, the same is found by
Coverity, which complains on the code.

Fixes: cd65869512ab ("io_uring: use iov_iter state save/restore helpers")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cdb31c29 24-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: don't punt files update to io-wq unconditionally

There's no reason to punt it unconditionally, we just need to ensure that
the submit lock grabbing is conditional.

Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9990da93 24-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: put provided buffer meta data under memcg accounting

For each provided buffer, we allocate a struct io_buffer to hold the
data associated with it. As a large number of buffers can be provided,
account that data with memcg.

Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8bab4c09 24-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: allow conditional reschedule for intensive iterators

If we have a lot of threads and rings, the tctx list can get quite big.
This is especially true if we keep creating new threads and rings.
Likewise for the provided buffers list. Be nice and insert a conditional
reschedule point while iterating the nodes for deletion.

Link: https://lore.kernel.org/io-uring/00000000000064b6b405ccb41113@google.com/
Reported-by: syzbot+111d2a03f51f5ae73775@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5b7aa38d 22-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix potential req refcount underflow

For multishot mode, there may be cases like:

iowq original context
io_poll_add
_arm_poll()
mask = vfs_poll() is not 0
if mask
(2) io_poll_complete()
compl_unlock
(interruption happens
tw queued to original
context)
io_poll_task_func()
compl_lock
(3) done = io_poll_complete() is true
compl_unlock
put req ref
(1) if (poll->flags & EPOLLONESHOT)
put req ref

EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple
combinations that can cause ref underfow.
Let's address it by:
- check the return value in (2) as done
- change (1) to if (done)
in this way, we only do ref put in (1) if 'oneshot flag' is from
(2)
- do poll.done check in io_poll_task_func(), so that we won't put ref
for the second time.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a62682f9 22-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix missing set of EPOLLONESHOT for CQ ring overflow

We should set EPOLLONESHOT if cqring_fill_event() returns false since
io_poll_add() decides to put req or not by it.

Fixes: 5082620fb2ca ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bd99c71b 22-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix race between poll completion and cancel_hash insertion

If poll arming and poll completion runs in parallel, there maybe races.
For instance, run io_poll_add in iowq and io_poll_task_func in original
context, then:

iowq original context
io_poll_add
vfs_poll
(interruption happens
tw queued to original
context) io_poll_task_func
generate cqe
del from cancel_hash[]
if !poll.done
insert to cancel_hash[]

The entry left in cancel_hash[], similar case for fast poll.
Fix it by set poll.done = true when del from cancel_hash[].

Fixes: 5082620fb2ca ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87c16966 21-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: ensure we exit if thread group is exiting

Dave reports that a coredumping workload gets stuck in 5.15-rc2, and
identified the culprit in the Fixes line below. The problem is that
relying solely on fatal_signal_pending() to gate whether to exit or not
fails miserably if a process gets eg SIGILL sent. Don't exclusively
rely on fatal signals, also check if the thread group is exiting.

Fixes: 15e20db2e0ce ("io-wq: only exit on fatal signals")
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66019837 22-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactoring lock in ntfs_init_acl

This is possible because of moving lock into ntfs_create_inode.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

ba77237e 22-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Change posix_acl_equiv_mode to posix_acl_update_mode

Right now ntfs3 uses posix_acl_equiv_mode instead of
posix_acl_update_mode like all other fs.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

398c35f4 22-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Pass flags to ntfs_set_ea in ntfs_set_acl_ex

In case of removing of xattr there must be XATTR_REPLACE flag and
zero length. We already check XATTR_REPLACE in ntfs_set_ea, so
now we pass XATTR_REPLACE to ntfs_set_ea.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0bd5fdb8 22-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Refactor ntfs_get_acl_ex for better readability

We can safely move set_cached_acl because it works with NULL acl too.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d562e901 23-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Move ni_lock_dir and ni_unlock into ntfs_create_inode

Now ntfs3 locks mutex for smaller time.
Theoretically in successful cases those locks aren't needed at all.
But proving the same for error cases is difficult.
So instead of removing them we just move them.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

6c1ee4d3 23-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Fix logical error in ntfs_create_inode

We need to always call indx_delete_entry after indx_insert_entry
if error occurred.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9ed38fd4 23-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: fix incorrect check for null pointer in header_assemble

Although very unlikely that the tlink pointer would be null in this case,
get_next_mid function can in theory return null (but not an error)
so need to check for null (not for IS_ERR, which can not be returned
here).

Address warning:

fs/smbfs_client/connect.c:2392 cifs_match_super()
warn: 'tlink' isn't an ERR_PTR

Pointed out by Dan Carpenter via smatch code analysis tool

CC: stable@vger.kernel.org
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

1db1aa98 23-Sep-2021 Steve French <stfrench@microsoft.com>

smb3: correct server pointer dereferencing check to be more consistent

Address warning:

fs/smbfs_client/misc.c:273 header_assemble()
warn: variable dereferenced before check 'treeCon->ses->server'

Pointed out by Dan Carpenter via smatch code analysis tool

Although the check is likely unneeded, adding it makes the code
more consistent and easier to read, as the same check is
done elsewhere in the function.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

f9e36107 23-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- regression fix for leak of transaction handle after verity rollback
failure

- properly reset device last error between mounts

- improve one error handling case when checksumming bios

- fixup confusing displayed size of space info free space

* tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent __btrfs_dump_space_info() to underflow its free space
btrfs: fix mount failure due to past and transient device flush error
btrfs: fix transaction handle leak after verity rollback failure
btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling


b06d893e 23-Sep-2021 Steve French <stfrench@microsoft.com>

smb3: correct smb3 ACL security descriptor

Address warning:

fs/smbfs_client/smb2pdu.c:2425 create_sd_buf()
warn: struct type mismatch 'smb3_acl vs cifs_acl'

Pointed out by Dan Carpenter via smatch code analysis tool

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

4f222622 22-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: Clear modified attribute bit from inode flags

Clear CIFS_INO_MODIFIED_ATTR bit from inode flags after
updating mtime and ctime

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Steve French <stfrench@microsoft.com>

03ab9cb9 20-Sep-2021 David Howells <dhowells@redhat.com>

cifs: Deal with some warnings from W=1

Deal with some warnings generated from make W=1:

(1) Add/remove/fix kerneldoc parameters descriptions.

(2) Turn cifs' rqst_page_get_length()'s banner comment into a kerneldoc
comment. It should probably be prefixed with "cifs_" though.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

82cb8753 21-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove deprecated mount options nls

Some discussion has been spoken that this deprecated mount options
should be removed before 5.15 lands. This driver is not never seen day
light so it was decided that nls mount option has to be removed. We have
always possibility to add this if needed.

One possible need is example if current ntfs driver will be taken out of
kernel and ntfs3 needs to support mount options what it has.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

808bc0a8 18-Sep-2021 Christophe JAILLET <christophe.jaillet@wanadoo.fr>

fs/ntfs3: Remove a useless shadowing variable

There is already a 'u8 mask' defined at the top of the function.
There is no need to define a new one here.

Remove the useless and shadowing new 'mask' variable.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d2846bf3 18-Sep-2021 Christophe JAILLET <christophe.jaillet@wanadoo.fr>

fs/ntfs3: Remove a useless test in 'indx_find()'

'fnd' has been dereferenced several time before, so testing it here is
pointless.
Moreover, all callers of 'indx_find()' already have some error handling
code that makes sure that no NULL 'fnd' is passed.

So, remove the useless test.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

c40dd3ca 13-Sep-2021 Yue Hu <huyue2@yulong.com>

erofs: clear compacted_2b if compacted_4b_initial > totalidx

Currently, the whole indexes will only be compacted 4B if
compacted_4b_initial > totalidx. So, the calculated compacted_2b
is worthless for that case. It may waste CPU resources.

No need to update compacted_4b_initial as mkfs since it's used to
fulfill the alignment of the 1st compacted_2b pack and would handle
the case above.

We also need to clarify compacted_4b_end here. It's used for the
last lclusters which aren't fitted in the previous compacted_2b
packs.

Some messages are from Xiang.

Link: https://lore.kernel.org/r/20210914035915.1190-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[ Gao Xiang: it's enough to use "compacted_4b_initial < totalidx". ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

d705117d 22-Sep-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: fix misbehavior of unsupported chunk format check

Unsupported chunk format should be checked with
"if (vi->chunkformat & ~EROFS_CHUNK_FORMAT_ALL)"

Found when checking with 4k-byte blockmap (although currently mkfs
uses inode chunk indexes format by default.)

Link: https://lore.kernel.org/r/20210922095141.233938-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

4ea47798 20-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: remove follow symlinks support

Use LOOKUP_NO_SYMLINKS flags for default lookup to prohibit the middle of
symlink component lookup and remove follow symlinks parameter support.
We re-implement it as reparse point later.

Test result:
smbclient -Ulinkinjeon%1234 //172.30.1.42/share -c
"get hacked/passwd passwd"
NT_STATUS_OBJECT_NAME_NOT_FOUND opening remote file \hacked\passwd

Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

18a015bc 22-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: check protocol id in ksmbd_verify_smb_message()

When second smb2 pdu has invalid protocol id, ksmbd doesn't detect it
and allow to process smb2 request. This patch add the check it in
ksmbd_verify_smb_message() and don't use protocol id of smb2 request as
protocol id of response.

Reviewed-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Ralph Böhme <slow@samba.org>
Reported-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

80f6e308 16-Sep-2021 Eric Biggers <ebiggers@google.com>

fs-verity: fix signed integer overflow with i_size near S64_MAX

If the file size is almost S64_MAX, the calculated number of Merkle tree
levels exceeds FS_VERITY_MAX_LEVELS, causing FS_IOC_ENABLE_VERITY to
fail. This is unintentional, since as the comment above the definition
of FS_VERITY_MAX_LEVELS states, it is enough for over U64_MAX bytes of
data using SHA-256 and 4K blocks. (Specifically, 4096*128**8 >= 2**64.)

The bug is actually that when the number of blocks in the first level is
calculated from i_size, there is a signed integer overflow due to i_size
being signed. Fix this by treating i_size as unsigned.

This was found by the new test "generic: test fs-verity EFBIG scenarios"
(https://lkml.kernel.org/r/b1d116cd4d0ea74b9cd86f349c672021e005a75c.1631558495.git.boris@bur.io).

This didn't affect ext4 or f2fs since those have a smaller maximum file
size, but it did affect btrfs which allows files up to S64_MAX bytes.

Reported-by: Boris Burkov <boris@bur.io>
Fixes: 3fda4c617e84 ("fs-verity: implement FS_IOC_ENABLE_VERITY ioctl")
Fixes: fd2d1acfcadf ("fs-verity: add the hook for file ->open()")
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Boris Burkov <boris@bur.io>
Link: https://lore.kernel.org/r/20210916203424.113376-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>

cf1d2c3e 22-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
"Critical bug fixes:

- Fix crash in NLM TEST procedure

- NFSv4.1+ backchannel not restored after PATH_DOWN"

* tag 'nfsd-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: back channel stuck in SEQ4_STATUS_CB_PATH_DOWN
NLM: Fix svcxdr_encode_owner()


372d1f3e 21-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

ext2: fix sleeping in atomic bugs on error

The ext2_error() function syncs the filesystem so it sleeps. The caller
is holding a spinlock so it's not allowed to sleep.

ext2_statfs() <- disables preempt
-> ext2_count_free_blocks()
-> ext2_get_group_desc()

Fix this by using WARN() to print an error message and a stack trace
instead of using ext2_error().

Link: https://lore.kernel.org/r/20210921203233.GA16529@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>

e946d3c8 21-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

cifs: fix a sign extension bug

The problem is the mismatched types between "ctx->total_len" which is
an unsigned int, "rc" which is an int, and "ctx->rc" which is a
ssize_t. The code does:

ctx->rc = (rc == 0) ? ctx->total_len : rc;

We want "ctx->rc" to store the negative "rc" error code. But what
happens is that "rc" is type promoted to a high unsigned int and
'ctx->rc" will store the high positive value instead of a negative
value.

The fix is to change "rc" from an int to a ssize_t.

Fixes: c610c4b619e5 ("CIFS: Add asynchronous write support through kernel AIO")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

9f632331 18-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add default data stream name in FILE_STREAM_INFORMATION

Windows client expect to get default stream name(::DATA) in
FILE_STREAM_INFORMATION response even if there is no stream data in file.
This patch fix update failure when writing ppt or doc files.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

e44fd508 20-Sep-2021 Steve French <stfrench@microsoft.com>

ksmbd: log that server is experimental at module load

While we are working through detailed security reviews
of ksmbd server code we should remind users that it is an
experimental module by adding a warning when the module
loads. Currently the module shows as experimental
in Kconfig and is disabled by default, but we don't want
to confuse users.

Although ksmbd passes a wide variety of the
important functional tests (since initial focus had
been largely on functional testing such as smbtorture,
xfstests etc.), and ksmbd has added key security
features (e.g. GCM256 encryption, Kerberos support),
there are ongoing detailed reviews of the code base
for path processing and network buffer decoding, and
this patch reminds users that the module should be
considered "experimental."

Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

708c8716 05-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

ceph: fix off by one bugs in unsafe_request_wait()

The "> max" tests should be ">= max" to prevent an out of bounds access
on the next lines.

Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

63544672 09-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Add sync flag to ntfs_sb_write_run and al_update

This allows to wait only when it's requested.
It speeds up creation of hardlinks.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

56eaeb10 09-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Change max hardlinks limit to 4000

xfstest generic/041 works with 3003 hardlinks.
Because of this we raise hardlinks limit to 4000.
There are no drawbacks or regressions.
Theoretically we can raise all the way up to ffff,
but there is no practical use for this.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d5f65459 20-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

qnx4: work around gcc false positive warning bug

In commit b7213ffa0e58 ("qnx4: avoid stringop-overread errors") I tried
to teach gcc about how the directory entry structure can be two
different things depending on a status flag. It made the code clearer,
and it seemed to make gcc happy.

However, Arnd points to a gcc bug, where despite using two different
members of a union, gcc then gets confused, and uses the size of one of
the members to decide if a string overrun happens. And not necessarily
the rigth one.

End result: with some configurations, gcc-11 will still complain about
the source buffer size being overread:

fs/qnx4/dir.c: In function 'qnx4_readdir':
fs/qnx4/dir.c:76:32: error: 'strnlen' specified bound [16, 48] exceeds source size 1 [-Werror=stringop-overread]
76 | size = strnlen(name, size);
| ^~~~~~~~~~~~~~~~~~~
fs/qnx4/dir.c:26:22: note: source object declared here
26 | char de_name;
| ^~~~~~~

because gcc will get confused about which union member entry is actually
getting accessed, even when the source code is very clear about it. Gcc
internally will have combined two "redundant" pointers (pointing to
different union elements that are at the same offset), and takes the
size checking from one or the other - not necessarily the right one.

This is clearly a gcc bug, but we can work around it fairly easily. The
biggest thing here is the big honking comment about why we do what we
do.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578#c6
Reported-and-tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ee9d4810 08-Sep-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Fix insertion of attr in ni_ins_attr_ext

Do not try to insert attribute if there is no room in record.

Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

af505cad 01-Sep-2021 Nirmoy Das <nirmoy.das@amd.com>

debugfs: debugfs_create_file_size(): use IS_ERR to check for error

debugfs_create_file() returns encoded error so use IS_ERR for checking
return value.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
Cc: stable <stable@vger.kernel.org>
References: https://gitlab.freedesktop.org/drm/amd/-/issues/1686
Link: https://lore.kernel.org/r/20210902102917.2233-1-nirmoy.das@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d9fb6784 20-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
"Fixes for AFS problems that can cause data corruption due to
interaction with another client modifying data cached locally:

- When d_revalidating a dentry, don't look at the inode to which it
points. Only check the directory to which the dentry belongs. This
was confusing things and causing the silly-rename cleanup code to
remove the file now at the dentry of a file that got deleted.

- Fix mmap data coherency. When a callback break is received that
relates to a file that we have cached, the data content may have
been changed (there are other reasons, such as the user's rights
having been changed). However, we're checking it lazily, only on
entry to the kernel, which doesn't happen if we have a writeable
shared mapped page on that file.

We make the kernel keep track of mmapped files and clear all PTEs
mapping to that file as soon as the callback comes in by calling
unmap_mapping_pages() (we don't necessarily want to zap the
pagecache). This causes the kernel to be reentered when userspace
tries to access the mmapped address range again - and at that point
we can query the server and, if we need to, zap the page cache.

Ideally, I would check each file at the point of notification, but
that involves poking the server[*] - which is holding an exclusive
lock on the vnode it is changing, waiting for all the clients it
notified to reply. This could then deadlock against the server.
Further, invalidating the pagecache might call ->launder_page(),
which would try to write to the file, which would definitely
deadlock. (AFS doesn't lease file access).

[*] Checking to see if the file content has changed is a matter of
comparing the current data version number, but we have to ask
the server for that. We also need to get a new callback promise
and we need to poke the server for that too.

- Add some more points at which the inode is validated, since we're
doing it lazily, notably in ->read_iter() and ->page_mkwrite(), but
also when performing some directory operations.

Ideally, checking in ->read_iter() would be done in some derivation
of filemap_read(). If we're going to call the server to read the
file, then we get the file status fetch as part of that.

- The above is now causing us to make a lot more calls to
afs_validate() to check the inode - and afs_validate() takes the
RCU read lock each time to make a quick check (ie.
afs_check_validity()). This is entirely for the purpose of checking
cb_s_break to see if the server we're using reinitialised its list
of callbacks - however this isn't a very common event, so most of
the time we're taking this needlessly.

Add a new cell-wide counter to count the number of
reinitialisations done by any server and check that - and only if
that changes, take the RCU read lock and check the server list (the
server list may change, but the cell a file is part of won't).

- Don't update vnode->cb_s_break and ->cb_v_break inside the validity
checking loop. The cb_lock is done with read_seqretry, so we might
go round the loop a second time after resetting those values - and
that could cause someone else checking validity to miss something
(I think).

Also included are patches for fixes for some bugs encountered whilst
debugging this:

- Fix a leak of afs_read objects and fix a leak of keys hidden by
that.

- Fix a leak of pages that couldn't be added to extend a writeback.

- Fix the maintenance of i_blocks when i_size is changed by a local
write or a local dir edit"

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217 [1]
Link: https://lore.kernel.org/r/163111665183.283156.17200205573146438918.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/ # i_blocks patch

* tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix updating of i_blocks on file/dir extension
afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server
afs: Try to avoid taking RCU read lock when checking vnode validity
afs: Fix mmap coherency vs 3rd-party changes
afs: Fix incorrect triggering of sillyrename on 3rd-party invalidation
afs: Add missing vnode validation checks
afs: Fix page leak
afs: Fix missing put on afs_read objects and missing get on the key therein


707a63e9 20-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:
"Three ksmbd fixes, including an important security fix for path
processing, and a buffer overflow check, and a trivial fix for
incorrect header inclusion"

* tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbd:
ksmbd: add validation for FILE_FULL_EA_INFORMATION of smb2_get_info
ksmbd: prevent out of share access
ksmbd: transport_rdma: Don't include rwlock.h directly


fdf50784 20-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:

- two deferred close fixes (for bugs found with xfstests 478 and 461)

- a deferred close improvement in rename

- two trivial fixes for incorrect Linux comment formatting of multiple
cifs files (pointed out by automated kernel test robot and
checkpatch)

* tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Not to defer close on file when lock is set
cifs: Fix soft lockup during fsstress
cifs: Deferred close performance improvements
cifs: fix incorrect kernel doc comments
cifs: remove pathname for file from SPDX header


880301bb 10-Sep-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Fix a memory leak on object opts

Currently a failed allocation on sbi->upcase will cause an exit via
the label free_sbi causing a memory leak on object opts. Fix this by
re-ordering the exit paths free_opts and free_sbi so that kfree's occur
in the reverse allocation order.

Addresses-Coverity: ("Resource leak")
Fixes: 27fac77707a1 ("fs/ntfs3: Init spi more in init_fs_context than fill_super")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

28861e3b 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Initiliaze sb blocksize only in one place + refactor

Right now sb blocksize first get initiliazed in fill_super but in can be
changed in helper function. It makes more sense to that this happened
only in one place.

Because we move this to helper function it makes more sense that
s_maxbytes will also be there. I rather have every sb releted thing in
fill_super, but because there is already sb releted stuff in this
helper. This will have to do for now.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0e59a87e 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Initialize pointer before use place in fill_super

Initializing should be as close as possible when we use it so that
we do not need to scroll up to see what is happening.

Also bdev_get_queue() can never return NULL so we do not need to check
for !rq.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0056b273 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove tmp pointer upcase in fill_super

We can survive without this tmp point upcase. So remove it we don't have
so many tmp pointer in this function.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

4ea41b3e 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove tmp pointer bd_inode in fill_super

Drop tmp pointer bd_inode because this is only used ones in fill_super.
Also we have so many initializing happening at the beginning that it is
already way too much to follow.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0cde7e81 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove tmp var is_ro in ntfs_fill_super

We only use this in two places so we do not really need it. Also
wrapper sb_rdonly() is pretty self explanatory. This will make little
bit easier to read this super long variable list in the beginning of
ntfs_fill_super().

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b4f110d6 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use sb instead of sbi->sb in fill_super

Use sb instead of sbi->sb in fill_super. We have sb so why not use
it. Also makes code more readable.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

10b4f12c 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unnecessary variable loading in fill_super

Remove some unnecessary variable loading. These look like copy paste
work and they are not used to anything.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

bce1828f 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Return straight without goto in fill_super

In many places it is not needed to use goto out. We can just return
right away. This will make code little bit more cleaner as we won't
need to check error path.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

5d7d6b16 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove impossible fault condition in fill_super

Remove root drop when we fault out. This can never happened because
when we allocate root we eather fault when no root or success.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

7ea04817 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change EINVAL to ENOMEM when d_make_root fails

Change EINVAL to ENOMEM when d_make_root fails because that is right
errno.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0412016e 09-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix wrong error message $Logfile -> $UpCase

Fix wrong error message $Logfile -> $UpCase. Probably copy paste.

Fixes: 203c2b3a406a ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

6d56262c 18-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add validation for FILE_FULL_EA_INFORMATION of smb2_get_info

Add validation to check whether req->InputBufferLength is smaller than
smb2_ea_info_req structure size.

Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

f58eae6c 17-Sep-2021 Hyunchul Lee <hyc.lee@gmail.com>

ksmbd: prevent out of share access

Because of .., files outside the share directory
could be accessed. To prevent this, normalize
the given path and remove all . and ..
components.

In addition to the usual large set of regression tests (smbtorture
and xfstests), ran various tests on this to specifically check
path name validation including libsmb2 tests to verify path
normalization:

./examples/smb2-ls-async smb://172.30.1.15/homes2/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/..bar/
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar..
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar../../../../

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

35866f3f 17-Sep-2021 Rohith Surabattula <rohiths@microsoft.com>

cifs: Not to defer close on file when lock is set

Close file immediately when lock is set.

Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

71826b06 17-Sep-2021 Rohith Surabattula <rohiths@microsoft.com>

cifs: Fix soft lockup during fsstress

Below traces are observed during fsstress and system got hung.
[ 130.698396] watchdog: BUG: soft lockup - CPU#6 stuck for 26s!

Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

e3fc0656 17-Sep-2021 Rohith Surabattula <rohiths@microsoft.com>

cifs: Deferred close performance improvements

During unlink/rename instead of closing all the deferred handles
under tcon, close only handles under the requested dentry.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

0619b790 16-Sep-2021 Qu Wenruo <wqu@suse.com>

btrfs: prevent __btrfs_dump_space_info() to underflow its free space

It's not uncommon where __btrfs_dump_space_info() gets called
under over-commit situations.

In that case free space would underflow as total allocated space is not
enough to handle all the over-committed space.

Such underflow values can sometimes cause confusion for users enabled
enospc_debug mount option, and takes some seconds for developers to
convert the underflow value to signed result.

Just output the free space as s64 to avoid such problem.

Reported-by: Eli V <eliventer@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUSy4zgyhf-4d9T+KdJp9w=UgzC2A0V=VtmaeEpcGgm1-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6b225baa 08-Sep-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix mount failure due to past and transient device flush error

When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.

However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.

When this happens we get an error like this:

$ mount /dev/sdc /mnt
mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

$ dmesg
(...)
[12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
[12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
[12918.890853] BTRFS error (device sdc): open_ctree failed

The failure happens because when btrfs_check_rw_degradable() is called at
mount time, or at remount from RO to RW time, is sees a non zero value in
a device's ->last_flush_error attribute, and therefore considers that the
device is 'missing'.

Fix this by setting a device's ->last_flush_error to zero when we close a
device, making sure the error is not seen on the next mount attempt. We
only need to track flush errors during the current mount, so that we never
commit a super block if such errors happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

acbee9af 08-Sep-2021 Filipe Manana <fdmanana@suse.com>

btrfs: fix transaction handle leak after verity rollback failure

During a verity rollback, if we fail to update the inode or delete the
orphan, we abort the transaction and return without releasing our
transaction handle. Fix that by releasing the handle.

Fixes: 146054090b0859 ("btrfs: initial fsverity support")
Fixes: 705242538ff348 ("btrfs: verity metadata orphan items")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bbc9a6eb 16-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling

There is a BUG_ON() in btrfs_csum_one_bio() to catch code logic error.
It has indeed caught several bugs during subpage development.
But the BUG_ON() itself will bring down the whole system which is
an overkill.

Replace it with a WARN() and exit gracefully, so that it won't crash the
whole system while we can still catch the code logic error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ddf21bd8 17-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block

Pull io_uring iov_iter retry fixes from Jens Axboe:
"This adds a helper to save/restore iov_iter state, and modifies
io_uring to use it.

After that is done, we can now kill the iter->truncated addition that
we added for this release. The io_uring change is being overly
cautious with the save/restore/advance, but better safe than sorry and
we can always improve that and reduce the overhead if it proves to be
of concern. The only case to be worried about in this regard is huge
IO, where iteration can take a while to iterate segments.

I spent some time writing test cases, and expanded the coverage quite
a bit from the last posting of this. liburing carries this regression
test case now:

https://git.kernel.dk/cgit/liburing/tree/test/file-verify.c

which exercises all of this. It now also supports provided buffers,
and explicitly tests for end-of-file/device truncation as well.

On top of that, Pavel sanitized the IOPOLL retry path to follow the
exact same pattern as normal IO"

* tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block:
io_uring: move iopoll reissue into regular IO path
Revert "iov_iter: track truncated size"
io_uring: use iov_iter state save/restore helpers
iov_iter: add helper to save iov_iter state


0bc7eb03 17-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-09-17' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"Mostly fixes for regressions in this cycle, but also a few fixes that
predate this release.

The odd one out is a tweak to the direct files added in this release,
where attempting to reuse a slot is allowed instead of needing an
explicit removal of that slot first. It's a considerable improvement
in usability to that API, hence I'm sending it for -rc2.

- io-wq race fix and cleanup (Hao)

- loop_rw_iter() type fix

- SQPOLL max worker race fix

- Allow poll arm for O_NONBLOCK files, fixing a case where it's
impossible to properly use io_uring if you cannot modify the file
flags

- Allow direct open to simply reuse a slot, instead of needing it
explicitly removed first (Pavel)

- Fix a case where we missed signal mask restoring in cqring_wait, if
we hit -EFAULT (Xiaoguang)"

* tag 'io_uring-5.15-2021-09-17' of git://git.kernel.dk/linux-block:
io_uring: allow retry for O_NONBLOCK if async is supported
io_uring: auto-removal for direct open/accept
io_uring: fix missing sigmask restore in io_cqring_wait()
io_uring: pin SQPOLL data before unlocking ring lock
io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items
io-wq: fix potential race of acct->nr_workers
io-wq: code clean of io_wqe_create_worker()
io_uring: ensure symmetry in handling iter types in loop_rw_iter()


02579b2f 16-Sep-2021 Dai Ngo <dai.ngo@oracle.com>

nfsd: back channel stuck in SEQ4_STATUS_CB_PATH_DOWN

When the back channel enters SEQ4_STATUS_CB_PATH_DOWN state, the client
recovers by sending BIND_CONN_TO_SESSION but the server fails to recover
the back channel and leaves it as NFSD4_CB_DOWN.

Fix by enhancing nfsd4_bind_conn_to_session to probe the back channel
by calling nfsd4_probe_callback.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

89c485c7 16-Sep-2021 Chuck Lever <chuck.lever@oracle.com>

NLM: Fix svcxdr_encode_owner()

Dai Ngo reports that, since the XDR overhaul, the NLM server crashes
when the TEST procedure wants to return NLM_DENIED. There is a bug
in svcxdr_encode_owner() that none of our standard test cases found.

Replace the open-coded function with a call to an appropriate
pre-fabricated XDR helper.

Reported-by: Dai Ngo <Dai.Ngo@oracle.com>
Fixes: a6a63ca5652e ("lockd: Common NLM XDR helpers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a9b3043d 11-Sep-2021 Mike Galbraith <efault@gmx.de>

ksmbd: transport_rdma: Don't include rwlock.h directly

rwlock.h specifically asks to not be included directly.

In fact, the proper spinlock.h include isn't needed either,
it comes with the huge pile that kthread.h ends up pulling
in, so just drop it entirely.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

23ca067b 01-Sep-2021 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

mm: Fully initialize invalidate_lock, amend lock class later

The function __init_rwsem() is not part of the official API, it just a helper
function used by init_rwsem().
Changing the lock's class and name should be done by using
lockdep_set_class_and_name() after the has been fully initialized. The overhead
of the additional class struct and setting it twice is negligible and it works
across all locks.

Fully initialize the lock with init_rwsem() and then set the custom class and
name for the lock.

Fixes: 730633f0b7f95 ("mm: Protect operations adding pages to page cache with invalidate_lock")
Link: https://lore.kernel.org/r/20210901084403.g4fezi23cixemlhh@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jan Kara <jack@suse.cz>

6e3331ee 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use min/max macros instated of ternary operators

We can make code little bit more readable by using min/max macros.

These were found with Coccinelle.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b5322eb1 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use clamp/max macros instead of comparisons

We can make code little more readable by using kernel macros clamp/max.

This were found with kernel included Coccinelle minmax script.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

f162f7b8 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove always false condition check

We do not need this check as this is same thing as
NTFS_MIN_MFT_ZONE > zlen. We already check NTFS_MIN_MFT_ZONE <= zlen and
exit because is too big request. Remove it so code is cleaner.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

edb853ff 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix ntfs_look_for_free_space() does only report -ENOSPC

If ntfs_refresh_zone() returns error it will be changed to -ENOSPC. It
is not right. Also caller of this functions also check other errors.

Fixes: 78ab59fee07f ("fs/ntfs3: Rework file operations")
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

cffb5152 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove tabs before spaces from comment

Remove tabs before spaces from comment as recommended by kernel coding
style. Checkpatch also warn about these.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

2829e39e 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove braces from single statment block

Remove braces from single statment block as they are not needed. Also
Linux kernel coding style guide recommend this and checkpatch warn about
this.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

4ca7fe57 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Place Comparisons constant right side of the test

For better code readability place constant always right side of the
test. This will also address checkpatch warning.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

7d95995a 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove '+' before constant in ni_insert_resident()

No need for plus sign here. So remove it. Checkpatch will also be happy.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b7213ffa 15-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

qnx4: avoid stringop-overread errors

The qnx4 directory entries are 64-byte blocks that have different
contents depending on the a status byte that is in the last byte of the
block.

In particular, a directory entry can be either a "link info" entry with
a 48-byte name and pointers to the real inode information, or an "inode
entry" with a smaller 16-byte name and the full inode information.

But the code was written to always just treat the directory name as if
it was part of that "inode entry", and just extend the name to the
longer case if the status byte said it was a link entry.

That work just fine and gives the right results, but now that gcc is
tracking data structure accesses much more, the code can trigger a
compiler error about using up to 48 bytes (the long name) in a structure
that only has that shorter name in it:

fs/qnx4/dir.c: In function ‘qnx4_readdir’:
fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds source size 16 [-Werror=stringop-overread]
51 | size = strnlen(de->di_fname, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs/qnx4/qnx4.h:3,
from fs/qnx4/dir.c:16:
include/uapi/linux/qnx4_fs.h:45:25: note: source object declared here
45 | char di_fname[QNX4_SHORT_NAME_MAX];
| ^~~~~~~~

which is because the source code doesn't really make this whole "one of
two different types" explicit.

Fix this by introducing a very explicit union of the two types, and
basically explaining to the compiler what is really going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b66ceaf3 15-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move iopoll reissue into regular IO path

230d50d448acb ("io_uring: move reissue into regular IO path")
made non-IOPOLL I/O to not retry from ki_complete handler. Follow it
steps and do the same for IOPOLL. Same problems, same implementation,
same -EAGAIN assumptions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f80dfee2d5fa7678f0052a8ab3cfca9496a112ca.1631699928.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd658695 10-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: use iov_iter state save/restore helpers

Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d329e12 14-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: allow retry for O_NONBLOCK if async is supported

A common complaint is that using O_NONBLOCK files with io_uring can be a
bit of a pain. Be a bit nicer and allow normal retry IFF the file does
support async behavior. This makes it possible to use io_uring more
reliably with O_NONBLOCK files, for use cases where it either isn't
possible or feasible to modify the file flags.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9c7b0ba8 14-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: auto-removal for direct open/accept

It might be inconvenient that direct open/accept deviates from the
update semantics and fails if the slot is taken instead of removing a
file sitting there. Implement this auto-removal.

Note that removal might need to allocate and so may fail. However, if an
empty slot is specified, it's guaraneed to not fail on the fd
installation side for valid userspace programs. It's needed for users
who can't tolerate such failures, e.g. accept where the other end
never retries.

Suggested-by: Franz-B. Tuneke <franz-bernhard.tuneke@tu-dortmund.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c896f14ea46b0eaa6c09d93149e665c2c37979b4.1631632300.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

44df58d4 14-Sep-2021 Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

io_uring: fix missing sigmask restore in io_cqring_wait()

Move get_timespec() section in io_cqring_wait() before the sigmask
saving, otherwise we'll fail to restore sigmask once get_timespec()
returns error.

Fixes: c73ebb685fb6 ("io_uring: add timeout support for io_uring_enter()")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210914143852.9663-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

41d3a6bd 13-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: pin SQPOLL data before unlocking ring lock

We need to re-check sqd->thread after we've dropped the lock. Pin
the sqd before doing the lockdep lock dance, and check if the thread
is alive after that. It's either NULL or alive, as the SQPOLL thread
cannot exit without holding the same sqd->lock.

Reported-and-tested-by: syzbot+337de45f13a4fd54d708@syzkaller.appspotmail.com
Fixes: fa84693b3c89 ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4c51de1e 13-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: fix incorrect kernel doc comments

Correct kernel-doc comments pointed out by the
automated kernel test robot.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

099dd788 13-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: remove pathname for file from SPDX header

checkpatch complains about source files with filenames (e.g. in
these cases just below the SPDX header in comments at the top of
various files in fs/cifs). It also is helpful to change this now
so will be less confusing when the parent directory is renamed
e.g. from fs/cifs to fs/smb_client (or fs/smbfs)

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

8e692122 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Always use binary search with entry search

We do not have any reason to keep old linear search in. Before this was
used for error path or if table was so big that it cannot be allocated.
Current binary search implementation won't need error path. Remove old
references to linear entry search.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

ef929700 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Make binary search to search smaller chunks in beginning

We could try to optimize algorithm to first fill just small table and
after that use bigger table all the way up to ARRAY_SIZE(offs). This
way we can use bigger search array, but not lose benefits with entry
count smaller < ARRAY_SIZE(offs).

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

162333ef 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Limit binary search table size

Current binary search allocates memory for table and fill whole table
before we start actual binary search. This is quite inefficient because
table fill will always be O(n). Also if table is huge we need to
reallocate memory which is costly.

This implementation use just stack memory and always when table is full
we will check if last element is <= and if not start table fill again.
The idea was that it would be same cost as table reallocation.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9c2aadd0 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unneeded header files from c files

We have lot of unnecessary headers in these files. Remove them so that
we help compiler a little bit.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

977d0558 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change right headers to lznt.c

There is lot of headers which we do not need in this file. Delete them
and add what we really need. Here is list which identify why we need
this header.

<linux/kernel.h> // min()
<linux/slab.h> // kzalloc()
<linux/stddef.h> // offsetof()
<linux/string.h> // memcpy(), memset()
<linux/types.h> // u8, size_t, etc.

"debug.h" // PtrOffset()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

f9767661 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change right headers to upcase.c

There is no headers. They will be included through ntfs_fs.c, but that
is not right thing to do. Let's include headers what this file need
straight away.

types.h is needed for __le16, u8 etc.
kernel.h is needed for le16_to_cpu()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

c632f639 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change right headers to bitfunc.c

We only need linux/types.h for types like u8 etc. So we can remove rest
and help compiler a little bit.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b6ba8103 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add missing header and guards to lib/ headers

size_t needs header. Add missing header guards so that compiler will
only include these ones.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

f239b3a9 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add missing headers and forward declarations to ntfs_fs.h

We do not have headers at all in this file. We should have them so that
not every .c file needs to include all of the stuff which this file need
for building. This way we can remove some headers from other files and
get better picture what is needed. This can save some compilation time.
And this can help if we sometimes want to separate this one big header.

Also use forward declarations for structs and enums when it not included
straight with include and it is used in function declarations input.
This will prevent possible compiler warning:
xxx declared inside parameter list will not be visible
outside of this definition or declaration

Here is list which I made when parsing this. There is not necessarily
all example from this header file, but this just proofs we need it.

<linux/blkdev.h> SECTOR_SHIFT
<linux/buffer_head.h> sb_bread(), put_bh
<linux/cleancache.h> put_page()
<linux/fs.h> struct inode (Just struct ntfs_inode need it)
<linux/highmem.h> kunmap(), kmap()
<linux/kernel.h> cpu_to_leXX() ALIGN
<linux/mm.h> kvfree()
<linux/mutex.h> struct mutex, mutex_(un/try)lock()
<linux/page-flags.h> PageError()
<linux/pagemap.h> read_mapping_page()
<linux/rbtree.h> struct rb_root
<linux/rwsem.h> struct rw_semaphore
<linux/slab.h> krfree(), kzalloc()
<linux/string.h> memset()
<linux/time64.h> struct timespec64
<linux/types.h> uXX, __leXX
<linux/uidgid.h> kuid_t, kgid_t
<asm/div64.h> do_div()
<asm/page.h> PAGE_SIZE

"debug.h" ntfs_err() (Just one entry. Maybe we can drop this)
"ntfs.h" Do you even ask?

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

4dfe8332 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add missing header files to ntfs.h

We do not have header files at all in this file. Add following headers
and there is also explanation which for it was added. Note that
explanation might not be complete, but it just proofs it is needed.

<linux/blkdev.h> // SECTOR_SHIFT
<linux/build_bug.h> // static_assert()
<linux/kernel.h> // cpu_to_le64, cpu_to_le32, ALIGN
<linux/stddef.h> // offsetof()
<linux/string.h> // memcmp()
<linux/types.h> //__le32, __le16

"debug.h" // PtrOffset(), Add2Ptr()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

cde81f13 02-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3. Add forward declarations for structs to debug.h

Add forward declarations for structs so that we can include this file
without warnings even without linux/fs.h

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

0327c6d0 03-Sep-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Remove redundant initialization of variable err

The variable err is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

dd47c104 13-Sep-2021 Eugene Syromiatnikov <esyr@redhat.com>

io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items

The items passed in the array pointed by the arg parameter
of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation
carry certain semantics: they refer to different io-wq worker categories;
provide IO_WQ_* constants in the UAPI, so these categories can be referenced
in the user space code.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Complements: 2e480058ddc21ec5 ("io-wq: provide a way to limit max number of workers")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/20210913154415.GA12890@asgard.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9d37e1ca 08-Sep-2021 David Howells <dhowells@redhat.com>

afs: Fix updating of i_blocks on file/dir extension

When an afs file or directory is modified locally such that the total file
size is extended, i_blocks needs to be recalculated too.

Fix this by making afs_write_end() and afs_edit_dir_add() call
afs_set_i_size() rather than setting inode->i_size directly as that also
recalculates inode->i_blocks.

This can be tested by creating and writing into directories and files and
then examining them with du. Without this change, directories show a 4
blocks (they start out at 2048 bytes) and files show 0 blocks; with this
change, they should show a number of blocks proportional to the file size
rounded up to 1024.

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/

b537a3c2 09-Sep-2021 David Howells <dhowells@redhat.com>

afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server

AFS-3 has two data fetch RPC variants, FS.FetchData and FS.FetchData64, and
Linux's afs client switches between them when talking to a non-YFS server
if the read size, the file position or the sum of the two have the upper 32
bits set of the 64-bit value.

This is a problem, however, since the file position and length fields of
FS.FetchData are *signed* 32-bit values.

Fix this by capturing the capability bits obtained from the fileserver when
it's sent an FS.GetCapabilities RPC, rather than just discarding them, and
then picking out the VICED_CAPABILITY_64BITFILES flag. This can then be
used to decide whether to use FS.FetchData or FS.FetchData64 - and also
FS.StoreData or FS.StoreData64 - rather than using upper_32_bits() to
switch on the parameter values.

This capabilities flag could also be used to limit the maximum size of the
file, but all servers must be checked for that.

Note that the issue does not exist with FS.StoreData - that uses *unsigned*
32-bit values. It's also not a problem with Auristor servers as its
YFS.FetchData64 op uses unsigned 64-bit values.

This can be tested by cloning a git repo through an OpenAFS client to an
OpenAFS server and then doing "git status" on it from a Linux afs
client[1]. Provided the clone has a pack file that's in the 2G-4G range,
the git status will show errors like:

error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index

This can be observed in the server's FileLog with something like the
following appearing:

Sun Aug 29 19:31:39 2021 SRXAFS_FetchData, Fid = 2303380852.491776.3263114, Host 192.168.11.201:7001, Id 1001
Sun Aug 29 19:31:39 2021 CheckRights: len=0, for host=192.168.11.201:7001
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: Pos 18446744071815340032, Len 3154
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: file size 2400758866
...
Sun Aug 29 19:31:40 2021 SRXAFS_FetchData returns 5

Note the file position of 18446744071815340032. This is the requested file
position sign-extended.

Fixes: b9b1f8d5930a ("AFS: write support fixes")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
cc: openafs-devel@openafs.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c9 [1]
Link: https://lore.kernel.org/r/951332.1631308745@warthog.procyon.org.uk/

4fe6a946 02-Sep-2021 David Howells <dhowells@redhat.com>

afs: Try to avoid taking RCU read lock when checking vnode validity

Try to avoid taking the RCU read lock when checking the validity of a
vnode's callback state. The only thing it's needed for is to pin the
parent volume's server list whilst we search it to find the record of the
server we're currently using to see if it has been reinitialised (ie. it
sent us a CB.InitCallBackState* RPC).

Do this by the following means:

(1) Keep an additional per-cell counter (fs_s_break) that's incremented
each time any of the fileservers in the cell reinitialises.

Since the new counter can be accessed without RCU from the vnode, we
can check that first - and only if it differs, get the RCU read lock
and check the volume's server list.

(2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
indicates whether the callback promise is still expected to be present
on the server. This does the checks as described in (1).

(3) Restructure afs_check_validity() to take account of the change in (2).

We can also get rid of the valid variable and just use the need_clear
variable with the addition of the afs_cb_break_no_promise reason.

(4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.

Move the change to vnode->cb_v_break to __afs_break_callback().

Delegate the change to vnode->cb_s_break to afs_select_fileserver()
and set vnode->cb_fs_s_break there also.

(5) afs_validate() no longer needs to get the RCU read lock around its
call to afs_check_validity() - and can skip the call entirely if we
don't have a promise.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/

6e0e99d5 02-Sep-2021 David Howells <dhowells@redhat.com>

afs: Fix mmap coherency vs 3rd-party changes

Fix the coherency management of mmap'd data such that 3rd-party changes
become visible as soon as possible after the callback notification is
delivered by the fileserver. This is done by the following means:

(1) When we break a callback on a vnode specified by the CB.CallBack call
from the server, we queue a work item (vnode->cb_work) to go and
clobber all the PTEs mapping to that inode.

This causes the CPU to trip through the ->map_pages() and
->page_mkwrite() handlers if userspace attempts to access the page(s)
again.

(Ideally, this would be done in the service handler for CB.CallBack,
but the server is waiting for our reply before considering, and we
have a list of vnodes, all of which need breaking - and the process of
getting the mmap_lock and stripping the PTEs on all CPUs could be
quite slow.)

(2) Call afs_validate() from the ->map_pages() handler to check to see if
the file has changed and to get a new callback promise from the
server.

Also handle the fileserver telling us that it's dropping all callbacks,
possibly after it's been restarted by sending us a CB.InitCallBackState*
call by the following means:

(3) Maintain a per-cell list of afs files that are currently mmap'd
(cell->fs_open_mmaps).

(4) Add a work item to each server that is invoked if there are any open
mmaps when CB.InitCallBackState happens. This work item goes through
the aforementioned list and invokes the vnode->cb_work work item for
each one that is currently using this server.

This causes the PTEs to be cleared, causing ->map_pages() or
->page_mkwrite() to be called again, thereby calling afs_validate()
again.

I've chosen to simply strip the PTEs at the point of notification reception
rather than invalidate all the pages as well because (a) it's faster, (b)
we may get a notification for other reasons than the data being altered (in
which case we don't want to clobber the pagecache) and (c) we need to ask
the server to find out - and I don't want to wait for the reply before
holding up userspace.

This was tested using the attached test program:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(int argc, char *argv[])
{
size_t size = getpagesize();
unsigned char *p;
bool mod = (argc == 3);
int fd;
if (argc != 2 && argc != 3) {
fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
exit(2);
}
fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
if (fd < 0) {
perror(argv[1]);
exit(1);
}

p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (;;) {
if (mod) {
p[0]++;
msync(p, size, MS_ASYNC);
fsync(fd);
}
printf("%02x", p[0]);
fflush(stdout);
sleep(1);
}
}

It runs in two modes: in one mode, it mmaps a file, then sits in a loop
reading the first byte, printing it and sleeping for a second; in the
second mode it mmaps a file, then sits in a loop incrementing the first
byte and flushing, then printing and sleeping.

Two instances of this program can be run on different machines, one doing
the reading and one doing the writing. The reader should see the changes
made by the writer, but without this patch, they aren't because validity
checking is being done lazily - only on entry to the filesystem.

Testing the InitCallBackState change is more complicated. The server has
to be taken offline, the saved callback state file removed and then the
server restarted whilst the reading-mode program continues to run. The
client machine then has to poke the server to trigger the InitCallBackState
call.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/

63d49d84 01-Sep-2021 David Howells <dhowells@redhat.com>

afs: Fix incorrect triggering of sillyrename on 3rd-party invalidation

The AFS filesystem is currently triggering the silly-rename cleanup from
afs_d_revalidate() when it sees that a dentry has been changed by a third
party[1]. It should not be doing this as the cleanup includes deleting the
silly-rename target file on iput.

Fix this by removing the places in the d_revalidate handling that validate
anything other than the directory and the dirent. It probably should not
be looking to validate the target inode of the dentry also.

This includes removing the point in afs_d_revalidate() where the inode that
a dentry used to point to was marked as being deleted (AFS_VNODE_DELETED).
We don't know it got deleted. It could have been renamed or it could have
hard links remaining.

This was reproduced by cloning a git repo onto an afs volume on one
machine, switching to another machine and doing "git status", then
switching back to the first and doing "git status". The second status
would show weird output due to ".git/index" getting deleted by the above
mentioned mechanism.

A simpler way to do it is to do:

machine 1: touch a
machine 2: touch b; mv -f b a
machine 1: stat a

on an afs volume. The bug shows up as the stat failing with ENOENT and the
file server log showing that machine 1 deleted "a".

Fixes: 79ddbfa500b3 ("afs: Implement sillyrename for unlink and rename")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c4 [1]
Link: https://lore.kernel.org/r/163111668100.283156.3851669884664475428.stgit@warthog.procyon.org.uk/

3978d816 01-Sep-2021 David Howells <dhowells@redhat.com>

afs: Add missing vnode validation checks

afs_d_revalidate() should only be validating the directory entry it is
given and the directory to which that belongs; it shouldn't be validating
the inode/vnode to which that dentry points. Besides, validation need to
be done even if we don't call afs_d_revalidate() - which might be the case
if we're starting from a file descriptor.

In order for afs_d_revalidate() to be fixed, validation points must be
added in some other places. Certain directory operations, such as
afs_unlink(), already check this, but not all and not all file operations
either.

Note that the validation of a vnode not only checks to see if the
attributes we have are correct, but also gets a promise from the server to
notify us if that file gets changed by a third party.

Add the following checks:

- Check the vnode we're going to make a hard link to.
- Check the vnode we're going to move/rename.
- Check the vnode we're going to read from.
- Check the vnode we're going to write to.
- Check the vnode we're going to sync.
- Check the vnode we're going to make a mapped page writable for.

Some of these aren't strictly necessary as we're going to perform a server
operation that might get the attributes anyway from which we can determine
if something changed - though it might not get us a callback promise.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111667354.283156.12720698333342917516.stgit@warthog.procyon.org.uk/

767a65e9 11-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io-wq: fix potential race of acct->nr_workers

Given max_worker is 1, and we currently have 1 running and it is
exiting. There may be race like:
io_wqe_enqueue worker1
no work there and timeout
unlock(wqe->lock)
->insert work
-->io_worker_exit
lock(wqe->lock)
->if(!nr_workers) //it's still 1
unlock(wqe->lock)
goto run_cancel
lock(wqe->lock)
nr_workers--
->dec_running
->worker creation fails
unlock(wqe->lock)

We enqueued one work but there is no workers, causes hung.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7a842fb5 11-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io-wq: code clean of io_wqe_create_worker()

Remove do_create to save a local variable.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

16c8d2df 12-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: ensure symmetry in handling iter types in loop_rw_iter()

When setting up the next segment, we check what type the iter is and
handle it accordingly. However, when incrementing and processed amount
we do not, and both iter advance and addr/len are adjusted, regardless
of type. Split the increment side just like we do on the setup side.

Fixes: 4017eb91a9e7 ("io_uring: make loop_rw_iter() use original user supplied pointers")
Cc: stable@vger.kernel.org
Reported-by: Valentina Palmiotti <vpalmiotti@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fdfc3463 12-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull namei updates from Al Viro:
"Clearing fallout from mkdirat in io_uring series. The fix in the
kern_path_locked() patch plus associated cleanups"

* 'misc.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
putname(): IS_ERR_OR_NULL() is wrong here
namei: Standardize callers of filename_create()
namei: Standardize callers of filename_lookup()
rename __filename_parentat() to filename_parentat()
namei: Fix use after free in kern_path_locked


8d4a0b5d 12-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smbfs updates from Steve French:
"cifs/smb3 updates:

- DFS reconnect fix

- begin creating common headers for server and client

- rename the cifs_common directory to smbfs_common to be more
consistent ie change use of the name cifs to smb (smb3 or smbfs is
more accurate, as the very old cifs dialect has long been
superseded by smb3 dialects).

In the future we can rename the fs/cifs directory to fs/smbfs.

This does not include the set of multichannel fixes nor the two
deferred close fixes (they are still being reviewed and tested)"

* tag '5.15-rc-cifs-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: properly invalidate cached root handle when closing it
cifs: move SMB FSCTL definitions to common code
cifs: rename cifs_common to smbfs_common
cifs: update FSCTL definitions


78e70952 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- vduse driver ("vDPA Device in Userspace") supporting emulated virtio
block devices

- virtio-vsock support for end of record with SEQPACKET

- vdpa: mac and mq support for ifcvf and mlx5

- vdpa: management netlink for ifcvf

- virtio-i2c, gpio dt bindings

- misc fixes and cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
Documentation: Add documentation for VDUSE
vduse: Introduce VDUSE - vDPA Device in Userspace
vduse: Implement an MMU-based software IOTLB
vdpa: Support transferring virtual addressing during DMA mapping
vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
vhost-iotlb: Add an opaque pointer for vhost IOTLB
vhost-vdpa: Handle the failure of vdpa_reset()
vdpa: Add reset callback in vdpa_config_ops
vdpa: Fix some coding style issues
file: Export receive_fd() to modules
eventfd: Export eventfd_wake_count to modules
iova: Export alloc_iova_fast() and free_iova_fast()
virtio-blk: remove unneeded "likely" statements
virtio-balloon: Use virtio_find_vqs() helper
vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
vsock_test: update message bounds test for MSG_EOR
af_vsock: rename variables in receive loop
virtio/vsock: support MSG_EOR bit processing
vhost/vsock: support MSG_EOR bit processing
...


c605c396 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right
now as we have plenty of flags left, but could become one. (Hao)

- Fix lockdep issue introduced in this merge window (me)

- Fix a few issues with the worker creation (me, Pavel, Qiang)

- Fix regression with wq_has_sleeper() for IOPOLL (Pavel)

- Timeout link error propagation fix (Pavel)

* tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT
io_uring: fail links of cancelled timeouts
io-wq: fix memory leak in create_io_worker()
io-wq: fix silly logic error in io_task_work_match()
io_uring: drop ctx->uring_lock before acquiring sqd->lock
io_uring: fix missing mb() before waitqueue_active
io-wq: fix cancellation on create-worker failure


c0f7e49f 11-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- NVMe pull request from Christoph:
- fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
- update a MAINTAINERS email address (Chaitanya Kulkarni)
- set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
- handle errors from add_disk() (Luis Chamberlain)
- update the keep alive interval when kato is modified (Tatsuya Sasaki)
- fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
- do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
- only call synchronize_srcu when clearing current path (Daniel Wagner)
- revalidate paths during rescan (Hannes Reinecke)

- Split out the fs/block_dev into block/fops.c and block/bdev.c, which
has been long overdue. Do this now before -rc1, to avoid annoying
conflicts due to this (Christoph)

- blk-throtl use-after-free fix (Li)

- Improve plug depth for multi-device plugs, greatly increasing md
resync performance (Song)

- blkdev_show() locking fix (Tetsuo)

- n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
n64cart: fix return value check in n64cart_probe()
blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
block: move fs/block_dev.c to block/bdev.c
block: split out operations on block special files
blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
block: genhd: don't call blkdev_show() with major_names_lock held
nvme: update MAINTAINERS email address
nvme: add error handling support for add_disk()
nvme: only call synchronize_srcu when clearing current path
nvme: update keep alive interval when kato is modified
nvme-tcp: Do not reset transport on data digest errors
nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
nvmet: looks at the passthrough controller when initializing CAP
nvme: move nvme_multi_css into nvme.h
nvme-multipath: revalidate paths during rescan
nvme-multipath: set QUEUE_FLAG_NOWAIT


581b2027 01-Sep-2021 David Howells <dhowells@redhat.com>

afs: Fix page leak

There's a loop in afs_extend_writeback() that adds extra pages to a write
we want to make to improve the efficiency of the writeback by making it
larger. This loop stops, however, if we hit a page we can't write back
from immediately, but it doesn't get rid of the page ref we speculatively
acquired.

This was caused by the removal of the cleanup loop when the code switched
from using find_get_pages_contig() to xarray scanning as the latter only
gets a single page at a time, not a batch.

Fix this by putting the page on a ref on an early break from the loop.
Unfortunately, we can't just add that page to the pagevec we're employing
as we'll go through that and add those pages to the RPC call.

This was found by the generic/074 test. It leaks ~4GiB of RAM each time it
is run - which can be observed with "top".

Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111666635.283156.177701903478910460.stgit@warthog.procyon.org.uk/

345e1ae0 27-Aug-2021 David Howells <dhowells@redhat.com>

afs: Fix missing put on afs_read objects and missing get on the key therein

The afs_read objects created by afs_req_issue_op() get leaked because
afs_alloc_read() returns a ref and then afs_fetch_data() gets its own ref
which is released when the operation completes, but the initial ref is
never released.

Fix this by discarding the initial ref at the end of afs_req_issue_op().

This leak also covered another bug whereby a ref isn't got on the key
attached to the read record by afs_req_issue_op(). This isn't a problem as
long as the afs_read req never goes away...

Fix this by calling key_get() in afs_req_issue_op().

This was found by the generic/074 test. It leaks a bunch of kmalloc-192
objects each time it is run, which can be observed by watching
/proc/slabinfo.

Fixes: f7605fa869cf ("afs: Fix leak of afs_read objects")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163010394740.3035676.8516846193899793357.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163111665914.283156.3038561975681836591.stgit@warthog.procyon.org.uk/

4396a731 09-Sep-2021 Amir Goldstein <amir73il@gmail.com>

fsnotify: fix sb_connectors leak

Fix a leak in s_fsnotify_connectors counter in case of a race between
concurrent add of new fsnotify mark to an object.

The task that lost the race fails to drop the counter before freeing
the unused connector.

Following umount() hangs in fsnotify_sb_delete()/wait_var_event(),
because s_fsnotify_connectors never drops to zero.

Fixes: ec44610fe2b8 ("fsnotify: count all objects with attached connectors")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Link: https://lore.kernel.org/linux-fsdevel/20210907063338.ycaw6wvhzrfsfdlp@xzhoux.usersys.redhat.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

32c2d33e 06-Sep-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT

Build check of __REQ_F_LAST_BIT should be larger than, not equal or larger
than. It's perfectly valid to have __REQ_F_LAST_BIT be 32, as that means
that the last valid bit is 31 which does fit in the type.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210907032243.114190-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bf9f243f 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-ksmbd-part2' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:

- various fixes pointed out by coverity, and a minor cleanup patch

- id mapping and ownership fixes

- an smbdirect fix

* tag '5.15-rc-ksmbd-part2' of git://git.samba.org/ksmbd:
ksmbd: fix control flow issues in sid_to_id()
ksmbd: fix read of uninitialized variable ret in set_file_basic_info
ksmbd: add missing assignments to ret on ndr_read_int64 read calls
ksmbd: add validation for ndr read/write functions
ksmbd: remove unused ksmbd_file_table_flush function
ksmbd: smbd: fix dma mapping error in smb_direct_post_send_data
ksmbd: Reduce error log 'speed is unknown' to debug
ksmbd: defer notify_change() call
ksmbd: remove setattr preparations in set_file_basic_info()
ksmbd: ensure error is surfaced in set_file_basic_info()
ndr: fix translation in ndr_encode_posix_acl()
ksmbd: fix translation in sid_to_id()
ksmbd: fix subauth 0 handling in sid_to_id()
ksmbd: fix translation in acl entries
ksmbd: fix translation in ksmbd_acls_fattr()
ksmbd: fix translation in create_posix_rsp_buf()
ksmbd: fix translation in smb2_populate_readdir_entry()
ksmbd: fix lookup on idmapped mounts


8dde2086 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix max_inline mount option limit on 64k page system

- lockdep fixes:
- update bdev time in a safer way
- move bdev put outside of sb write section when removing device
- fix possible deadlock when mounting seed/sprout filesystem

- zoned mode: fix split extent accounting

- minor include fixup

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix double counting of split ordered extent
btrfs: fix lockdep warning while mounting sprout fs
btrfs: delay blkdev_put until after the device remove
btrfs: update the bdev time directly when closing
btrfs: use correct header for div_u64 in misc.h
btrfs: fix upper limit for max_inline for page size 64K


9351590f 09-Sep-2021 Enzo Matsumiya <ematsumiya@suse.de>

cifs: properly invalidate cached root handle when closing it

Cached root file was not being completely invalidated sometimes.

Reproducing:
- With a DFS share with 2 targets, one disabled and one enabled
- start some I/O on the mount
# while true; do ls /mnt/dfs; done
- at the same time, disable the enabled target and enable the disabled
one
- wait for DFS cache to expire
- on reconnect, the previous cached root handle should be invalid, but
open_cached_dir_by_dentry() will still try to use it, but throws a
use-after-free warning (kref_get())

Make smb2_close_cached_fid() invalidate all fields every time, but only
send an SMB2_close() when the entry is still valid.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>

d6c338a7 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

- Support for VMAP_STACK

- Support for splice_write in hostfs

- Fixes for virt-pci

- Fixes for virtio_uml

- Various fixes

* tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: fix stub location calculation
um: virt-pci: fix uapi documentation
um: enable VMAP_STACK
um: virt-pci: don't do DMA from stack
hostfs: support splice_write
um: virtio_uml: fix memory leak on init failures
um: virtio_uml: include linux/virtio-uml.h
lib/logic_iomem: fix sparse warnings
um: make PCI emulation driver init/exit static


35776f10 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM development updates from Russell King:

- Rename "mod_init" and "mod_exit" so that initcall debug output is
actually useful (Randy Dunlap)

- Update maintainers entries for linux-arm-kernel to indicate it is
moderated for non-subscribers (Randy Dunlap)

- Move install rules to arch/arm/Makefile (Masahiro Yamada)

- Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

- Don't warn about atags_to_fdt() stack size (David Heidelberg)

- Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

- Get rid of set_fs() usage (Arnd Bergmann)

- Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
ARM: 9116/1: unified: Remove check for gcc < 4
ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
ARM: 9113/1: uaccess: remove set_fs() implementation
ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
ARM: 9111/1: oabi-compat: rework fcntl64() emulation
ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
ARM: 9107/1: syscall: always store thread_info->abi_syscall
ARM: 9109/1: oabi-compat: add epoll_pwait handler
ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
ARM: 9105/1: atags_to_fdt: don't warn about stack size
ARM: 9103/1: Drop ARCH_NR_GPIOS definition
ARM: 9102/1: move theinstall rules to arch/arm/Makefile
ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific


f154c806 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:
"Except for the xpram device driver removal it is all about fixes and
cleanups.

- Fix topology update on cpu hotplug, so notifiers see expected
masks. This bug was uncovered with SCHED_CORE support.

- Fix stack unwinding so that the correct number of entries are
omitted like expected by common code. This fixes KCSAN selftests.

- Add kmemleak annotation to stack_alloc to avoid false positive
kmemleak warnings.

- Avoid layering violation in common I/O code and don't unregister
subchannel from child-drivers.

- Remove xpram device driver for which no real use case exists since
the kernel is 64 bit only. Also all hypervisors got required
support removed in the meantime, which means the xpram device
driver is dead code.

- Fix -ENODEV handling of clp_get_state in our PCI code.

- Enable KFENCE in debug defconfig.

- Cleanup hugetlbfs s390 specific Kconfig dependency.

- Quite a lot of trivial fixes to get rid of "W=1" warnings, and and
other simple cleanups"

* tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
hugetlbfs: s390 is always 64bit
s390/ftrace: remove incorrect __va usage
s390/zcrypt: remove incorrect kernel doc indicators
scsi: zfcp: fix kernel doc comments
s390/sclp: add __nonstring annotation
s390/hmcdrv_ftp: fix kernel doc comment
s390: remove xpram device driver
s390/pci: read clp_list_pci_req only once
s390/pci: fix clp_get_state() handling of -ENODEV
s390/cio: fix kernel doc comment
s390/ctrlchar: fix kernel doc comment
s390/con3270: use proper type for tasklet function
s390/cpum_cf: move array from header to C file
s390/mm: fix kernel doc comments
s390/topology: fix topology information when calling cpu hotplug notifiers
s390/unwind: use current_frame_address() to unwind current task
s390/configs: enable CONFIG_KFENCE in debug_defconfig
s390/entry: make oklabel within CHKSTG macro local
s390: add kmemleak annotation in stack_alloc()
s390/cio: dont unregister subchannel from child-drivers


7b871c77 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull gfs2 setattr updates from Al Viro:
"Make it possible for filesystems to use a generic 'may_setattr()' and
switch gfs2 to using it"

* 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gfs2: Switch to may_setattr in gfs2_setattr
fs: Move notify_change permission checks into may_setattr


e2e694b9 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull root filesystem type handling updates from Al Viro:
"Teach init/do_mounts.c to handle non-block filesystems, hopefully
preventing even more special-cased kludges (such as root=/dev/nfs,
etc)"

* 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: simplify get_filesystem_list / get_all_fs_names
init: allow mounting arbitrary non-blockdevice filesystems as root
init: split get_fs_names


7b7699c0 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter fixes from Al Viro:
"Fixes for io-uring handling of iov_iter reexpands"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
io_uring: reexpand under-reexpanded iters
iov_iter: track truncated size


2e5fd489 09-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

- Fix a race condition in the teardown path of raw mode pmem
namespaces.

- Cleanup the code that filesystems use to detect filesystem-dax
capabilities of their underlying block device.

* tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: remove bdev_dax_supported
xfs: factor out a xfs_buftarg_is_dax helper
dax: stub out dax_supported for !CONFIG_FS_DAX
dax: remove __generic_fsdax_supported
dax: move the dax_read_lock() locking into dax_supported
dax: mark dax_get_by_host static
dm: use fs_dax_get_by_bdev instead of dax_get_by_host
dax: stop using bdevname
fsdax: improve the FS_DAX Kconfig description and help text
libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind


15b2ae77 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Show uid/gid always in show_options()

Show options should show option according documentation when some value
is not default or when ever coder wants. Uid/gid are problematic because
it is hard to know which are defaults. In file system there is many
different implementation for this problem.

Some file systems show uid/gid when they are different than root, some
when user has set them and some show them always. There is also problem
that what if root uid/gid change. This code just choose to show them
always. This way we do not need to think this any more.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

28a941ff 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Rename mount option no_acs_rules > (no)acsrules

Rename mount option no_acs_rules to (no)acsrules. This allow us to use
possibility to mount with options noaclrules or aclrules.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

e274cde8 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add iocharset= mount option as alias for nls=

Other fs drivers are using iocharset= mount option for specifying charset.
So add it also for ntfs3 and mark old nls= mount option as deprecated.

Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9d1939f4 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Make mount option nohidden more universal

If we call Opt_nohidden with just keyword hidden, then we can use
hidden/nohidden when mounting. We already use this method for almoust
all other parameters so it is just logical that this will use same
method.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

27fac777 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Init spi more in init_fs_context than fill_super

init_fs_context() is meant to initialize s_fs_info (spi). Move spi
initializing code there which we can initialize before fill_super().

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

610f8f5a 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use new api for mounting

We have now new mount api as described in Documentation/filesystems. We
should use it as it gives us some benefits which are desribed here
lore.kernel.org/linux-fsdevel/159646178122.1784947.11705396571718464082.stgit@warthog.procyon.org.uk/

Nls loading is changed a to load with string. This did make code also
little cleaner.

Also try to use fsparam_flag_no as much as possible. This is just nice
little touch and is not mandatory but it should not make any harm. It
is just convenient that we can use example acl/noacl mount options.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

564c97bd 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Convert mount options to pointer in sbi

Use pointer to mount options. We want to do this because we will use new
mount api which will benefit that we have spi and mount options in
different allocations. When we remount we do not have to make whole new
spi it is enough that we will allocate just mount options.

Please note that we can do example remount lot cleaner but things will
change in next patch so this should be just functional.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

c2c389fd 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unnecesarry remount flag handling

Remove unnecesarry remount flag handling. This does not do anything for
this driver. We have already set SB_NODIRATIME when we fill super. Also
noatime should be set from mount option. Now for some reson we try to
set it when remounting.

Lazytime part looks like it is copied from f2fs and there is own mount
parameter for it. That is why they use it. We do not set lazytime
anywhere in our code. So basically this just blocks lazytime when
remounting.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

b8a30b41 07-Sep-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unnecesarry mount option noatime

Remove unnecesarry mount option noatime because this will be handled
by VFS. Our option parser will never get opt like this.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

2ae2eb9d 09-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fail links of cancelled timeouts

When we cancel a timeout we should mark it with REQ_F_FAIL, so
linked requests are cancelled as well, but not queued for further
execution.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

948ca5f3 19-Aug-2021 Eric Whitney <enwlinux@gmail.com>

ext4: enforce buffer head state assertion in ext4_da_map_blocks

Remove the code that re-initializes a buffer head with an invalid block
number and BH_New and BH_Delay bits when a matching delayed and
unwritten block has been found in the extent status cache. Replace it
with assertions that verify the buffer head already has this state
correctly set. The current code masked an inline data truncation bug
that left stale entries in the extent status cache. With this change,
generic/130 can be used to reproduce and detect that bug.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210819144927.25163-3-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

0add491d 19-Aug-2021 Eric Whitney <enwlinux@gmail.com>

ext4: remove extent cache entries when truncating inline data

Conditionally remove all cached extents belonging to an inode
when truncating its inline data. It's only necessary to attempt to
remove cached extents when a conversion from inline to extent storage
has been initiated (!EXT4_STATE_MAY_INLINE_DATA). This avoids
unnecessary es lock overhead in the more common inline case.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210819144927.25163-2-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

11ef08c9 04-Sep-2021 Theodore Ts'o <tytso@mit.edu>

Merge branch 'delalloc-buffer-write' into dev

Fix a bug in how we update i_disksize, and the error path in
inline_data_end. Finally, drop an unnecessary creation of a journal
handle which was only needed for inline data, which can give us a
large performance gain in delayed allocation writes.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>


66e70be7 09-Sep-2021 Qiang.zhang <qiang.zhang@windriver.com>

io-wq: fix memory leak in create_io_worker()

BUG: memory leak
unreferenced object 0xffff888126fcd6c0 (size 192):
comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s)
backtrace:
[<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline]
[<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline]
[<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739
[<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline]
[<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866
[<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473
[<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933
[<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233
[<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462
[<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158
[<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164
[<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline]
[<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline]
[<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
[<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209
[<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
[<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302
[<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
[<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

when create_io_thread() return error, and not retry, the worker object
need to be freed.

Reported-by: syzbot+65454c239241d3d647da@syzkaller.appspotmail.com
Signed-off-by: Qiang.zhang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210909115822.181188-1-qiang.zhang@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d014f5f 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: move SMB FSCTL definitions to common code

The FSCTL definitions are in smbfsctl.h which should be
shared by client and server. Move the updated version of
smbfsctl.h into smbfs_common and have the client code use
it (subsequent patch will change the server to use this
common version of the header).

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

23e91d8b 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: rename cifs_common to smbfs_common

As we move to common code between client and server, we have
been asked to make the names less confusing, and refer less
to "cifs" and more to words which include "smb" instead to
e.g. "smbfs" for the client (we already have "ksmbd" for the
kernel server, and "smbd" for the user space Samba daemon).
So to be more consistent in the naming of common code between
client and server and reduce the risk of merge conflicts as
more common code is added - rename "cifs_common" to
"smbfs_common" (in future releases we also will rename
the fs/cifs directory to fs/smbfs)

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

fc111fb9 08-Sep-2021 Steve French <stfrench@microsoft.com>

cifs: update FSCTL definitions

Add some missing defines used by ksmbd to the client
version of smbfsctl.h, and add a missing newer define
mentioned in the protocol definitions (MS-FSCC).

This will also make it easier to move to common code.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

3b33e3f4 08-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix silly logic error in io_task_work_match()

We check for the func with an OR condition, which means it always ends
up being false and we never match the task_work we want to cancel. In
the unexpected case that we do exit with that pending, we can trigger
a hang waiting for a worker to exit, but it was never created. syzbot
reports that as such:

INFO: task syz-executor687:8514 blocked for more than 143 seconds.
Not tainted 5.14.0-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor687 state:D stack:27296 pid: 8514 ppid: 8479 flags:0x00024004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0x940/0x26f0 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
io_wq_exit_workers fs/io-wq.c:1162 [inline]
io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197
io_uring_clean_tctx fs/io_uring.c:9607 [inline]
io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687
io_uring_files_cancel include/linux/io_uring.h:16 [inline]
do_exit+0x265/0x2a30 kernel/exit.c:780
do_group_exit+0x125/0x310 kernel/exit.c:922
get_signal+0x47f/0x2160 kernel/signal.c:2868
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x445cd9
RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c
RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154
R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000

While in there, also decrement accr->nr_workers. This isn't strictly
needed as we're exiting, but let's make sure the accounting matches up.

Fixes: 3146cba99aa2 ("io-wq: make worker creation resilient against signals")
Reported-by: syzbot+f62d3e0a4ea4f38f5326@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

009ad9f0 08-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: drop ctx->uring_lock before acquiring sqd->lock

The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock
for all the registration opcodes. We also hold a ref to the ctx, and we
do drop the lock for other reasons to quiesce, so it's fine to drop the
ctx lock temporarily to grab the sqd->lock. This fixes the following
lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.5/25433 is trying to acquire lock:
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline]
ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792

but task is already holding lock:
ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&ctx->uring_lock){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
__io_sq_thread fs/io_uring.c:7291 [inline]
io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

-> #0 (&sqd->lock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3051 [inline]
check_prevs_add kernel/locking/lockdep.c:3174 [inline]
validate_chain kernel/locking/lockdep.c:3789 [inline]
__lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
__mutex_lock_common kernel/locking/mutex.c:596 [inline]
__mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
__io_uring_register fs/io_uring.c:10757 [inline]
__do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&ctx->uring_lock);
lock(&sqd->lock);
lock(&ctx->uring_lock);
lock(&sqd->lock);

*** DEADLOCK ***

Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3fc37253 08-Sep-2021 Dan Williams <dan.j.williams@intel.com>

Merge branch 'for-5.15/fsdax-cleanups' into for-5.15/libnvdimm

Include Christoph's rework of the dax_supported() helpers in the v5.15
libnvdimm update. This supports the ongoing dax-reflink enabling effort.


8a05abd0 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

- a set of patches to address fsync stalls caused by depending on
periodic rather than triggered MDS journal flushes in some cases
(Xiubo Li)

- a fix for mtime effectively not getting updated in case of competing
writers (Jeff Layton)

- a couple of fixes for inode reference leaks and various WARNs after
"umount -f" (Xiubo Li)

- a new ceph.auth_mds extended attribute (Jeff Layton)

- a smattering of fixups and cleanups from Jeff, Xiubo and Colin.

* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
ceph: fix dereference of null pointer cf
ceph: drop the mdsc_get_session/put_session dout messages
ceph: lockdep annotations for try_nonblocking_invalidate
ceph: don't WARN if we're forcibly removing the session caps
ceph: don't WARN if we're force umounting
ceph: remove the capsnaps when removing caps
ceph: request Fw caps before updating the mtime in ceph_write_iter
ceph: reconnect to the export targets on new mdsmaps
ceph: print more information when we can't find snaprealm
ceph: add ceph_change_snap_realm() helper
ceph: remove redundant initializations from mdsc and session
ceph: cancel delayed work instead of flushing on mdsc teardown
ceph: add a new vxattr to return auth mds for an inode
ceph: remove some defunct forward declarations
ceph: flush the mdlog before waiting on unsafe reqs
ceph: flush mdlog before umounting
ceph: make iterate_sessions a global symbol
ceph: make ceph_create_session_msg a global symbol
ceph: fix comment about short copies in ceph_write_end
ceph: fix memory leak on decode error in ceph_handle_caps


4cf0ccd0 06-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix control flow issues in sid_to_id()

Addresses-Coverity reported Control flow issues in sid_to_id()
/fs/ksmbd/smbacl.c: 277 in sid_to_id()
271
272 if (sidtype == SIDOWNER) {
273 kuid_t uid;
274 uid_t id;
275
276 id = le32_to_cpu(psid->sub_auth[psid->num_subauth - 1]);
>>> CID 1506810: Control flow issues (NO_EFFECT)
>>> This greater-than-or-equal-to-zero comparison of an unsigned value
>>> is always true. "id >= 0U".
277 if (id >= 0) {
278 /*
279 * Translate raw sid into kuid in the server's user
280 * namespace.
281 */
282 uid = make_kuid(&init_user_ns, id);

Addresses-Coverity: ("Control flow issues")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

4ffd5264 06-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: fix read of uninitialized variable ret in set_file_basic_info

Addresses-Coverity reported Uninitialized variables warninig :

/fs/ksmbd/smb2pdu.c: 5525 in set_file_basic_info()
5519 if (!rc) {
5520 inode->i_ctime = ctime;
5521 mark_inode_dirty(inode);
5522 }
5523 inode_unlock(inode);
5524 }
>>> CID 1506805: Uninitialized variables (UNINIT)
>>> Using uninitialized value "rc".
5525 return rc;
5526 }
5527
5528 static int set_file_allocation_info(struct ksmbd_work *work,
5529 struct ksmbd_file *fp, char *buf)
5530 {

Addresses-Coverity: ("Uninitialized variable")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

36bbeb33 06-Sep-2021 Colin Ian King <colin.king@canonical.com>

ksmbd: add missing assignments to ret on ndr_read_int64 read calls

Currently there are two ndr_read_int64 calls where ret is being checked
for failure but ret is not being assigned a return value from the call.
Static analyis is reporting the checks on ret as dead code. Fix this.

Addresses-Coverity: ("Logical dead code")
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

c57a91fb 08-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix missing mb() before waitqueue_active

In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a
memory barrier required by waitqueue_active(&ctx->poll_wait). There is
a wq_has_sleeper(), which does smb_mb() inside, but it's called only for
SQPOLL.

Fixes: 5fd4617840596 ("io_uring: be smarter about waking multiple CQ ring waiters")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2d338201 08-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...


6fcac87e 07-Sep-2021 QiuXi <qiuxi1@huawei.com>

coredump: fix memleak in dump_vma_snapshot()

dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot()
returns -EFAULT, the memory will be leaked, so we free it correctly.

Link: https://lkml.kernel.org/r/20210810020441.62806-1-qiuxi1@huawei.com
Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Signed-off-by: QiuXi <qiuxi1@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbd9d6f8 07-Sep-2021 David Oberhollenzer <david.oberhollenzer@sigma-star.at>

fs/coredump.c: log if a core dump is aborted due to changed file permissions

For obvious security reasons, a core dump is aborted if the filesystem
cannot preserve ownership or permissions of the dump file.

This affects filesystems like e.g. vfat, but also something like a 9pfs
share in a Qemu test setup, running as a regular user, depending on the
security model used. In those cases, the result is an empty core file and
a confused user.

To hopefully save other people a lot of time figuring out the cause, this
patch adds a simple log message for those specific cases.

[akpm@linux-foundation.org: s/|%s/%s/ in printk text]

Link: https://lkml.kernel.org/r/20210701233151.102720-1-david.oberhollenzer@sigma-star.at
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

98e2e409 07-Sep-2021 Zhen Lei <thunder.leizhen@huawei.com>

nilfs2: use refcount_dec_and_lock() to fix potential UAF

When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().

Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.

Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.

CPU0 CPU1
nilfs_put_root():
<-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);

kfree(root);
<-------- use-after-free

refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
kill_block_super+0x4a/0x90 fs/super.c:1446
deactivate_locked_super+0x6a/0xb0 fs/super.c:335
deactivate_super+0x85/0x90 fs/super.c:366
cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
__cleanup_mnt+0x15/0x20 fs/namespace.c:1125
task_work_run+0x8e/0x110 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There is no reproduction program, and the above is only theoretical
analysis.

Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

17243e1c 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group

kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del(). See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b2fe39c2 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group

If kobject_init_and_add returns with error, kobject_put() is needed here
to avoid memory leak, because kobject_init_and_add may return error
without freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a3e18125 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group

The kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del. See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24f8cb1e 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group

If kobject_init_and_add return with error, kobject_put() is needed here to
avoid memory leak, because kobject_init_and_add may return error without
freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dbc6e7d4 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix NULL pointer in nilfs_##name##_attr_release

In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer. The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method.
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.

Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5f5dec07 07-Sep-2021 Nanyong Sun <sunnanyong@huawei.com>

nilfs2: fix memory leak in nilfs_sysfs_create_device_group

Patch series "nilfs2: fix incorrect usage of kobject".

This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.

This patch (of 6):

Reported by syzkaller:

BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
kstrdup+0x36/0x70 mm/util.c:60
kstrdup_const+0x35/0x60 mm/util.c:83
kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
kobject_add_varg lib/kobject.c:384 [inline]
kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
legacy_get_tree+0x105/0x210 fs/fs_context.c:592
vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0xf9b/0x1990 fs/namespace.c:3235
do_mount+0xea/0x100 fs/namespace.c:3248
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.

And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.

Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1e1c1583 07-Sep-2021 Nicholas Piggin <npiggin@gmail.com>

fs/epoll: use a per-cpu counter for user's watches count

This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net

Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c2f273eb 07-Sep-2021 Ohhoon Kwon <ohoono.kwon@samsung.com>

connector: send event on write to /proc/[pid]/comm

While comm change event via prctl has been reported to proc connector by
'commit f786ecba4158 ("connector: add comm change event report to proc
connector")', connector listeners were missing comm changes by explicit
writes on /proc/[pid]/comm.

Let explicit writes on /proc/[pid]/comm report to proc connector.

Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8d23b208 07-Sep-2021 Christoph Hellwig <hch@lst.de>

proc: stop using seq_get_buf in proc_task_name

Use seq_escape_str and seq_printf instead of poking holes into the
seq_file abstraction.

Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bb9c14ad 08-Sep-2021 David Hildenbrand <david@redhat.com>

hugetlbfs: s390 is always 64bit

No need to check for 64BIT. While at it, let's just select
ARCH_SUPPORTS_HUGETLBFS from arch/s390/Kconfig.

Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210908154506.20764-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>

713b9825 08-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io-wq: fix cancellation on create-worker failure

WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1151 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1146 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151
io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794
Modules linked in:
Call Trace:
tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158
task_work_run+0xe0/0x1a0 kernel/task_work.c:164
tracehook_notify_signal include/linux/tracehook.h:212 [inline]
handle_signal_work kernel/entry/common.c:146 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae

When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just
call io_run_cancel() to clean up the request, it's already enqueued via
io_wqe_insert_work() and will be executed either by some other worker
during cancellation (e.g. in io_wq_put_and_exit()).

Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ea47ab11 07-Sep-2021 Al Viro <viro@zeniv.linux.org.uk>

putname(): IS_ERR_OR_NULL() is wrong here

Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b4a4f213 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Standardize callers of filename_create()

filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.

Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

794ebcea 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Standardize callers of filename_lookup()

filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.

[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example. It's more that we want non-consuming
variant for close relative of that function...]

Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c5f563f9 07-Sep-2021 Al Viro <viro@zeniv.linux.org.uk>

rename __filename_parentat() to filename_parentat()

... in separate commit, to avoid noise in previous one

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0766ec82 01-Sep-2021 Stephen Brennan <stephen.s.brennan@oracle.com>

namei: Fix use after free in kern_path_locked

In 0ee50b47532a ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the filename before returning, and kern_path_locked() was migrated to
this calling convention. However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.

In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat(). Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().

Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper. Remove filename_parentat(),
now that we have no remaining callers.

Fixes: 0ee50b47532a ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

75b96f0e 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

- Allow mounting an active fuse device. Previously the fuse device
would always be mounted during initialization, and sharing a fuse
superblock was only possible through mount or namespace cloning

- Fix data flushing in syncfs (virtiofs only)

- Fix data flushing in copy_file_range()

- Fix a possible deadlock in atomic O_TRUNC

- Misc fixes and cleanups

* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: remove unused arg in fuse_write_file_get()
fuse: wait for writepages in syncfs
fuse: flush extending writes
fuse: truncate pagecache on atomic_o_trunc
fuse: allow sharing existing sb
fuse: move fget() to fuse_get_tree()
fuse: move option checking into fuse_fill_super()
fuse: name fs_context consistently
fuse: fix use after free in fuse_read_interrupt()


0bcfe68b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "memcg: enable accounting for pollfd and select bits arrays"

This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b.

Just like with the memcg lock accounting, the kernel test robot reports
a sizeable performance regression for this commit, and while it clearly
does the rigth thing in theory, we'll need to look at just how to avoid
or minimize the performance overhead of the memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

[ Note: the first link below is for this same commit but a different
commit ID, because it's the kernel test robot ended up noticing it in
Andrew Morton's patch queue ]

Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3754707b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "memcg: enable accounting for file lock caches"

This reverts commit 0f12156dff2862ac54235fc72703f18770769042.

The kernel test robot reports a sizeable performance regression for this
commit, and while it clearly does the rigth thing in theory, we'll need
to look at just how to avoid or minimize the performance overhead of the
memcg accounting.

People already have suggestions on how to do that, but it's "future
work".

So revert it for now.

Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cd1adf1b 07-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"

This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0dca4462 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: move fs/block_dev.c to block/bdev.c

Move it together with the rest of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd82cca7 07-Sep-2021 Christoph Hellwig <hch@lst.de>

block: split out operations on block special files

Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f79645df 06-Sep-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix double counting of split ordered extent

btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes.
Then, splitting an ordered extent will call btrfs_add_ordered_extent_*()
again for split extents, leading to double counting of the region of
a split extent. These leaked bytes are finally reported at unmount time
as follow:

BTRFS info (device dm-1): at unmount dio bytes count 364544

Fix the double counting by subtracting split extent's size from
fs_info->ordered_bytes.

Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c1247069 30-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: fix lockdep warning while mounting sprout fs

Following test case reproduces lockdep warning.

Test case:

$ mkfs.btrfs -f <dev1>
$ btrfstune -S 1 <dev1>
$ mount <dev1> <mnt>
$ btrfs device add <dev2> <mnt> -f
$ umount <mnt>
$ mount <dev2> <mnt>
$ umount <mnt>

The warning claims a possible ABBA deadlock between the threads
initiated by [#1] btrfs device add and [#0] the mount.

[ 540.743122] WARNING: possible circular locking dependency detected
[ 540.743129] 5.11.0-rc7+ #5 Not tainted
[ 540.743135] ------------------------------------------------------
[ 540.743142] mount/2515 is trying to acquire lock:
[ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.743458] but task is already holding lock:
[ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743541] which lock already depends on the new lock.
[ 540.743543] the existing dependency chain (in reverse order) is:

[ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
[ 540.743566] down_read_nested+0x48/0x2b0
[ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
[ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
[ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
[ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
[ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
[ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
[ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
[ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
[ 540.744166] __x64_sys_ioctl+0xe4/0x140
[ 540.744170] do_syscall_64+0x4b/0x80
[ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
[ 540.744184] __lock_acquire+0x155f/0x2360
[ 540.744188] lock_acquire+0x10b/0x5c0
[ 540.744190] __mutex_lock+0xb1/0xf80
[ 540.744193] mutex_lock_nested+0x27/0x30
[ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
[ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
[ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
[ 540.744472] legacy_get_tree+0x38/0x90
[ 540.744475] vfs_get_tree+0x30/0x140
[ 540.744478] fc_mount+0x16/0x60
[ 540.744482] vfs_kern_mount+0x91/0x100
[ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
[ 540.744536] legacy_get_tree+0x38/0x90
[ 540.744537] vfs_get_tree+0x30/0x140
[ 540.744539] path_mount+0x8d8/0x1070
[ 540.744541] do_mount+0x8d/0xc0
[ 540.744543] __x64_sys_mount+0x125/0x160
[ 540.744545] do_syscall_64+0x4b/0x80
[ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 540.744551] other info that might help us debug this:
[ 540.744552] Possible unsafe locking scenario:

[ 540.744553] CPU0 CPU1
[ 540.744554] ---- ----
[ 540.744555] lock(btrfs-chunk-00);
[ 540.744557] lock(&fs_devs->device_list_mutex);
[ 540.744560] lock(btrfs-chunk-00);
[ 540.744562] lock(&fs_devs->device_list_mutex);
[ 540.744564]
*** DEADLOCK ***

[ 540.744565] 3 locks held by mount/2515:
[ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
[ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
[ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.744708]
stack backtrace:
[ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5

But the device_list_mutex in clone_fs_devices() is redundant, as
explained below. Two threads [1] and [2] (below) could lead to
clone_fs_device().

[1]
open_ctree <== mount sprout fs
btrfs_read_chunk_tree()
mutex_lock(&uuid_mutex) <== global lock
read_one_dev()
open_seed_devices()
clone_fs_devices() <== seed fs_devices
mutex_lock(&orig->device_list_mutex) <== seed fs_devices

[2]
btrfs_init_new_device() <== sprouting
mutex_lock(&uuid_mutex); <== global lock
btrfs_prepare_sprout()
lockdep_assert_held(&uuid_mutex)
clone_fs_devices(seed_fs_device) <== seed fs_devices

Both of these threads hold uuid_mutex which is sufficient to protect
getting the seed device(s) freed while we are trying to clone it for
sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
can not free/write/replace because it is read-only. An unmounted seed
device can be freed by btrfs_free_stale_devices(), but it needs
uuid_mutex. So this patch removes the unnecessary device_list_mutex in
clone_fs_devices(). And adds a lockdep_assert_held(&uuid_mutex) in
clone_fs_devices().

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3fa421de 27-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: delay blkdev_put until after the device remove

When removing the device we call blkdev_put() on the device once we've
removed it, and because we have an EXCL open we need to take the
->open_mutex on the block device to clean it up. Unfortunately during
device remove we are holding the sb writers lock, which results in the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #407 Not tainted
------------------------------------------------------
losetup/11595 is trying to acquire lock:
ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_put+0x3a/0x220
btrfs_rm_device.cold+0x62/0xe5
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11595:
#0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc21255d4cb

So instead save the bdev and do the put once we've dropped the sb
writers lock in order to avoid the lockdep recursion.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8f96a5bf 27-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: update the bdev time directly when closing

We update the ctime/mtime of a block device when we remove it so that
blkid knows the device changed. However we do this by re-opening the
block device and calling filp_update_time. This is more correct because
it'll call the inode->i_op->update_time if it exists, but the block dev
inodes do not do this. Instead call generic_update_time() on the
bd_inode in order to avoid the blkdev_open path and get rid of the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #406 Not tainted
------------------------------------------------------
losetup/11596 is trying to acquire lock:
ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
file_open_name+0xc7/0x170
filp_open+0x2c/0x50
btrfs_scratch_superblocks.part.0+0x10f/0x170
btrfs_rm_device.cold+0xe8/0xed
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);

*** DEADLOCK ***

1 lock held by losetup/11596:
#0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cde7417c 30-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

btrfs: use correct header for div_u64 in misc.h

asm/do_div.h is for div_u64, but it is found in math64.h. This change
will make compiler job easier and prevent compiler errors in situation
where compiler will not find math64.h from another paths.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6f93e834 10-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: fix upper limit for max_inline for page size 64K

The mount option max_inline ranges from 0 to the sectorsize (which is
now equal to page size). But we parse the mount options too early and
before the actual sectorsize is read from the superblock. So the upper
limit of max_inline is unaware of the actual sectorsize and is limited
by the temporary sectorsize 4096, even on a system where the default
sectorsize is 64K.

Fix this by reading the superblock sectorsize before the mount option
parse.

Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

60f8fbaa 06-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.

In detail:

- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.

- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.

- Completion trace int -> unsigned -> long fix

- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL

- Fix regression with hash wait lock in this merge window

- Fix retry issued on block devices (Ming)

- Fix regression with links in this merge window (Pavel)

- Fix race with multi-shot poll and completions (Xiaoguang)

- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)

- Ensure submissions are flushed after running task_work (Pavel)"

* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker


a9667ac8 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: remove unused arg in fuse_write_file_get()

The struct fuse_conn argument is not used and can be removed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

660585b5 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: wait for writepages in syncfs

In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called. This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.

This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.

Introduce sync buckets, that hold a counter for the number of outstanding
write requests. On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.

It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets. Descendant buckets must
not complete until the parent completes. Add a count to the child (new)
bucket until the (parent) old bucket completes.

Use RCU protection to dereference the current bucket and to wake up an
emptied bucket. Use fc->lock to protect against parallel assignments to
the current bucket.

This leaves just the counter to be a possible scalability issue. The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

9c930054 31-Aug-2021 Xie Yongji <xieyongji@bytedance.com>

file: Export receive_fd() to modules

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210831103634.33-4-xieyongji@bytedance.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

7a6b92d3 31-Aug-2021 Xie Yongji <xieyongji@bytedance.com>

eventfd: Export eventfd_wake_count to modules

Export eventfd_wake_count so that some modules can use
the eventfd_signal_count() to check whether the
eventfd_signal() call should be deferred to a safe context.

NB(mst): this patch is not needed in Linus tree since there
eventfd_signal_count() has been superseded by an already exported
eventfd_signal_allowed().

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210831103634.33-3-xieyongji@bytedance.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

0319b848 05-Sep-2021 Geert Uytterhoeven <geert@linux-m68k.org>

binfmt: a.out: Fix bogus semicolon

fs/binfmt_aout.c: In function ‘load_aout_library’:
fs/binfmt_aout.c:311:27: error: expected ‘)’ before ‘;’ token
311 | MAP_FIXED | MAP_PRIVATE;
| ^
fs/binfmt_aout.c:309:10: error: too few arguments to function ‘vm_mmap’
309 | error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
| ^~~~~~~
In file included from fs/binfmt_aout.c:12:
include/linux/mm.h:2626:35: note: declared here
2626 | extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
| ^~~~~~~

Fix this by reverting the accidental replacement of a comma by a
semicolon.

Fixes: 42be8b42535183f8 ("binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()")
Reported-by: noreply@ellerman.id.au
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cc883236 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: drop unnecessary journal handle in delalloc write

After we factor out the inline data write procedure from
ext4_da_write_end(), we don't need to start journal handle for the cases
of both buffer overwrite and append-write. If we need to update
i_disksize, mark_inode_dirty() do start handle and update inode buffer.
So we could just remove all the journal handle codes in the delalloc
write procedure.

After this patch, we could get a lot of performance improvement. Below
is the Unixbench comparison data test on my machine with 'Intel Xeon
Gold 5120' CPU and nvme SSD backend.

Test cmd:

./Run -c 56 -i 3 fstime fsbuffer fsdisk

Before this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1
File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0
======
System Benchmarks Index Score (Partial Only) 1186.6

After this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3
File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7
======
System Benchmarks Index Score (Partial Only) 2053.0

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.com

6984aef5 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: factor out write end code of inline file

Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com

55ce2f64 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: correct the error path of ext4_write_inline_data_end()

Current error path of ext4_write_inline_data_end() is not correct.

Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com

4df031ff 16-Jul-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: check and update i_disksize properly

After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
isize"), i_disksize could always be updated to i_size in ext4_setattr(),
and we could sure that i_disksize <= i_size since holding inode lock and
if i_disksize < i_size there are delalloc writes pending in the range
upto i_size. If the end of the current write is <= i_size, there's no
need to touch i_disksize since writeback will push i_disksize upto
i_size eventually. So we can switch to check i_size instead of
i_disksize in ext4_da_write_end() when write to the end of the file.
we also could remove ext4_mark_inode_dirty() together because we defer
inode dirtying to generic_write_end() or ext4_da_write_inline_data_end().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com

49624efa 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux

Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.

There are some (minor) user-visible changes:

- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().

- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).

- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"

* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()


f7464060 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://github.com/Paragon-Software-Group/linux-ntfs3

Merge NTFSv3 filesystem from Konstantin Komarov:
"This patch adds NTFS Read-Write driver to fs/ntfs3.

Having decades of expertise in commercial file systems development and
huge test coverage, we at Paragon Software GmbH want to make our
contribution to the Open Source Community by providing implementation
of NTFS Read-Write driver for the Linux Kernel.

This is fully functional NTFS Read-Write driver. Current version works
with NTFS (including v3.1) and normal/compressed/sparse files and
supports journal replaying.

We plan to support this version after the codebase once merged, and
add new features and fix bugs. For example, full journaling support
over JBD will be added in later updates"

Link: https://lore.kernel.org/lkml/20210729134943.778917-1-almaz.alexandrovich@paragon-software.com/
Link: https://lore.kernel.org/lkml/aa4aa155-b9b2-9099-b7a2-349d8d9d8fbd@paragon-software.com/

* git://github.com/Paragon-Software-Group/linux-ntfs3: (35 commits)
fs/ntfs3: Change how module init/info messages are displayed
fs/ntfs3: Remove GPL boilerplates from decompress lib files
fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Rework file operations
fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Fix error handling in indx_insert_into_root()
fs/ntfs3: Potential NULL dereference in hdr_find_split()
fs/ntfs3: Fix error code in indx_add_allocate()
fs/ntfs3: fix an error code in ntfs_get_acl_ex()
fs/ntfs3: add checks for allocation failure
fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
fs/ntfs3: Do not use driver own alloc wrappers
fs/ntfs3: Use kernel ALIGN macros over driver specific
fs/ntfs3: Restyle comment block in ni_parse_reparse()
fs/ntfs3: Remove unused including <linux/version.h>
fs/ntfs3: Fix fall-through warnings for Clang
fs/ntfs3: Fix one none utf8 char in source file
...


6abaa83c 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.

In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.

On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.

Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device

Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option

In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"

* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...


0961f0c0 04-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`

Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"

* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...


303fff2b 02-Sep-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: add validation for ndr read/write functions

If ndr->length is smaller than expected size, ksmbd can access invalid
access in ndr->data. This patch add validation to check ndr->offset is
over ndr->length. and added exception handling to check return value of
ndr read/write function.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

687c59e7 31-Aug-2021 Namjae Jeon <linkinjeon@kernel.org>

ksmbd: remove unused ksmbd_file_table_flush function

ksmbd_file_table_flush is a leftover from SMB1. This function is no longer
needed as SMB1 has been removed from ksmbd.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

72d6cbb5 29-Aug-2021 Hyunchul Lee <hyc.lee@gmail.com>

ksmbd: smbd: fix dma mapping error in smb_direct_post_send_data

Becase smb direct header is mapped and msg->num_sge
already is incremented, the decrement should be
removed from the condition.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

d475866e 29-Aug-2021 Per Forlin <perfn@axis.com>

ksmbd: Reduce error log 'speed is unknown' to debug

This log happens on servers with a network bridge since
the bridge does not have a specified link speed.
This is not a real error so change the error log to debug instead.

Signed-off-by: Per Forlin <perfn@axis.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

28a5d3de 26-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: defer notify_change() call

When ownership is changed we might in certain scenarios loose the
ability to alter the inode after we changed ownership. This can e.g.
happen when we are on an idmapped mount where uid 0 is mapped to uid
1000 and uid 1000 is mapped to uid 0.
A caller with fs*id 1000 will be able to create files as *id 1000 on
disk. They will also be able to change ownership of files owned by *id 0
to *id 1000 but they won't be able to change ownership in the other
direction. This means acl operations following notify_change() would
fail. Move the notify_change() call after the acls have been updated.
This guarantees that we don't end up with spurious "hash value diff"
warnings later on because we managed to change ownership but didn't
manage to alter acls.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

db7fb6fe 25-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: remove setattr preparations in set_file_basic_info()

Permission checking and copying over ownership information is the task
of the underlying filesystem not ksmbd. The order is also wrong here.
This modifies the inode before notify_change(). If notify_change() fails
this will have changed ownership nonetheless. All of this is unnecessary
though since the underlying filesystem's ->setattr handler will do all
this (if required) by itself.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

eb5784f0 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: ensure error is surfaced in set_file_basic_info()

It seems the error was accidently ignored until now. Make sure it is
surfaced.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

9467a0ce 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ndr: fix translation in ndr_encode_posix_acl()

The sid_to_id() helper encodes raw ownership information suitable for
s*id handling. This is conceptually equivalent to reporting ownership
information via stat to userspace. In this case the consumer is ksmbd
instead of a regular user. So when encoding raw ownership information
suitable for s*id handling later we need to map the id up according to
the user namespace of ksmbd itself taking any idmapped mounts into
account.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

55cd04d7 24-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in sid_to_id()

The sid_to_id() functions is relevant when changing ownership of
filesystem objects based on acl information. In this case we need to
first translate the relevant s*ids into k*ids in ksmbd's user namespace
and account for any idmapped mounts. Requesting a change in ownership
requires the inverse translation to be applied when we would report
ownership to userspace. So k*id_from_mnt() must be used here.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

f0bb29d5 24-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix subauth 0 handling in sid_to_id()

It's not obvious why subauth 0 would be excluded from translation. This
would lead to wrong results whenever a non-identity idmapping is used.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

0e844efe 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in acl entries

The ksmbd server performs translation of posix acls to smb acls.
Currently the translation is wrong since the idmapping of the mount is
used to map the ids into raw userspace ids but what is relevant is the
user namespace of ksmbd itself. The user namespace of ksmbd itself which
is the initial user namespace. The operation is similar to asking "What
*ids would a userspace process see given that k*id in the relevant user
namespace?". Before the final translation we need to apply the idmapping
of the mount in case any is used. Add two simple helpers for ksmbd.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

43205ca7 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in ksmbd_acls_fattr()

When creating new filesystem objects ksmbd translates between k*ids and
s*ids. For this it often uses struct smb_fattr and stashes the k*ids in
cf_uid and cf_gid. Let cf_uid and cf_gid always contain the final
information taking any potential idmapped mounts into account. When
finally translation cf_*id into s*ids translate them into the user
namespace of ksmbd since that is the relevant user namespace here.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

3cdc20e7 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in create_posix_rsp_buf()

When transferring ownership information to the client the k*ids are
translated into raw *ids before they are sent over the wire. The
function currently erroneously translates the k*ids according to the
mount's idmapping. Instead, reporting the owning *ids to userspace the
underlying k*ids need to be mapped up in the caller's user namespace.
This is how stat() works.
The caller in this instance is ksmbd itself and ksmbd always runs in the
initial user namespace. Translate according to that taking any potential
idmapped mounts into account.

Switch to from_k*id_munged() which ensures that the overflow*id is
returned instead of the (*id_t)-1 when the k*id can't be translated.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

475d6f98 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix translation in smb2_populate_readdir_entry()

When transferring ownership information to the
client the k*ids are translated into raw *ids before they are sent over
the wire. The function currently erroneously translates the k*ids
according to the mount's idmapping. Instead, reporting the owning *ids
to userspace the underlying k*ids need to be mapped up in the caller's
user namespace. This is how stat() works.
The caller in this instance is ksmbd itself and ksmbd always runs in the
initial user namespace. Translate according to that.

The idmapping of the mount is already taken into account by the lower
filesystem and so kstat->*id will contain the mapped k*ids.

Switch to from_k*id_munged() which ensures that the overflow*id is
returned instead of the (*id_t)-1 when the k*id can't be translated.

Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

da1e7ada 23-Aug-2021 Christian Brauner <christian.brauner@ubuntu.com>

ksmbd: fix lookup on idmapped mounts

It's great that the new in-kernel ksmbd server will support idmapped
mounts out of the box! However, lookup is currently broken. Lookup
helpers such as lookup_one_len() call inode_permission() internally to
ensure that the caller is privileged over the inode of the base dentry
they are trying to lookup under. So the permission checking here is
currently wrong.

Linux v5.15 will gain a new lookup helper lookup_one() that does take
idmappings into account. I've added it as part of my patch series to
make btrfs support idmapped mounts. The new helper is in linux-next as
part of David's (Sterba) btrfs for-next branch as commit
c972214c133b ("namei: add mapping aware lookup helper").

I've said it before during one of my first reviews: I would very much
recommend adding fstests to [1]. It already seems to have very
rudimentary cifs support. There is a completely generic idmapped mount
testsuite that supports idmapped mounts.

[1]: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

89c2b3b7 23-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: reexpand under-reexpanded iters

[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
[ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task
syz-executor.0/828
[ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
5.14.0-rc3-next-20210730 #1
[ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 74.219033] Call Trace:
[ 74.219683] dump_stack_lvl+0x8b/0xb3
[ 74.220706] print_address_description.constprop.0+0x1f/0x140
[ 74.224226] kasan_report.cold+0x7f/0x11b
[ 74.226085] iov_iter_revert+0x809/0x900
[ 74.227960] io_write+0x57d/0xe40
[ 74.232647] io_issue_sqe+0x4da/0x6a80
[ 74.242578] __io_queue_sqe+0x1ac/0xe60
[ 74.245358] io_submit_sqes+0x3f6e/0x76a0
[ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20
[ 74.257167] do_syscall_64+0x3b/0x90
[ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae

old_size = iov_iter_count();
...
iov_iter_revert(old_size - iov_iter_count());

If iov_iter_revert() is done base on the initial size as above, and the
iter is truncated and not reexpanded in the middle, it miscalculates
borders causing problems. This trace is due to no one reexpanding after
generic_write_checks().

Now iters store how many bytes has been truncated, so reexpand them to
the initial state right before reverting.

Cc: stable@vger.kernel.org
Reported-by: Palash Oswal <oswalpalash@gmail.com>
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b250e6d1 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.

- Show a warning if a different compiler is used for building external
modules.

- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.

- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.

- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.

- Add Nick Desaulniers as a Kbuild reviewer.

- Drop stale cc-option tests.

- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.

- Show a warning if 'FORCE' is missing for if_changed rules.

- Various cleanups

* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...


14726903 03-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...


22e5fe2a 02-Sep-2021 Nadav Amit <namit@vmware.com>

userfaultfd: prevent concurrent API initialization

userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.

However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.

Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.

To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.

Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a759a909 02-Sep-2021 Nadav Amit <namit@vmware.com>

userfaultfd: change mmap_changing to atomic

Patch series "userfaultfd: minor bug fixes".

Three unrelated bug fixes. The first two addresses possible issues (not
too theoretical ones), but I did not encounter them in practice.

The third patch addresses a test bug that causes the test to fail on my
system. It has been sent before as part of a bigger RFC.

This patch (of 3):

mmap_changing is currently a boolean variable, which is set and cleared
without any lock that protects against concurrent modifications.

mmap_changing is supposed to mark whether userfaultfd page-faults handling
should be retried since mappings are undergoing a change. However,
concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
mmap_changing to be false, although the remove event was still not read
(hence acknowledged) by the user.

Change mmap_changing to atomic_t and increase/decrease appropriately. Add
a debug assertion to see whether mmap_changing is negative.

Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5b78ed24 02-Sep-2021 Luigi Rizzo <lrizzo@google.com>

mm/pagemap: add mmap_assert_locked() annotations to find_vma*()

find_vma() and variants need protection when used. This patch adds
mmap_assert_lock() calls in the functions.

To make sure the invariant is satisfied, we also need to add a
mmap_read_lock() around the get_user_pages_remote() call in
get_arg_page(). The lock is not strictly necessary because the mm has
been newly created, but the extra cost is limited because the same mutex
was also acquired shortly before in __bprm_mm_init(), so it is hot and
uncontended.

[penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs]
Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp

Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f358afc5 02-Sep-2021 Christoph Hellwig <hch@lst.de>

mm: remove flush_kernel_dcache_page

flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.

The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:

1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible

Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

30acd0bd 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for new namesapces and struct nsproxy

Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations. It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

839d6820 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for fasync_cache

fasync_struct is used by almost all character device drivers to set up the
fasync queue, and for regular files by the file lease code. This
structure is quite small but long-living and it can be assigned for any
open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0f12156d 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for file lock caches

User can create file locks for each open file and force kernel to allocate
small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/b009f4c7-f0ab-c0ec-8e83-918f47d677da@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b6558434 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for pollfd and select bits arrays

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of
memory till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/56e31cb5-6e1e-bdba-d7ca-be64b9842363@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

79f6540b 02-Sep-2021 Vasily Averin <vvs@virtuozzo.com>

memcg: enable accounting for mnt_cache entries

Patch series "memcg accounting from OpenVZ", v7.

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1. Now we're
rebasing again, revising our old patches and trying to push them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready
yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.

This patch (of 10):

The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times. Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bb902cb4 02-Sep-2021 Yutian Yang <nglaive@gmail.com>

memcg: charge fs_context and legacy_fs_context

This patch adds accounting flags to fs_context and legacy_fs_context
allocation sites so that kernel could correctly charge these objects.

We have written a PoC to demonstrate the effect of the missing-charging
bugs. The PoC takes around 1,200MB unaccounted memory, while it is
charged for only 362MB memory usage. We evaluate the PoC on QEMU x86_64
v5.2.90 + Linux kernel v5.10.19 + Debian buster. All the limitations
including ulimits and sysctl variables are set as default. Specifically,
the hard NOFILE limit and nr_open in sysctl are both 1,048,576.

/*------------------------- POC code ----------------------------*/

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/file.h>
#include <time.h>
#include <sys/wait.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>
#include <fcntl.h>
#include <linux/mount.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define STACK_SIZE (8 * 1024)
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif
static inline int fsopen(const char *fs_name, unsigned int flags)
{
return syscall(__NR_fsopen, fs_name, flags);
}

static char thread_stack[512][STACK_SIZE];

int thread_fn(void* arg)
{
for (int i = 0; i< 800000; ++i) {
int fsfd = fsopen("nfs", FSOPEN_CLOEXEC);
if (fsfd == -1) {
errExit("fsopen");
}
}
while(1);
return 0;
}

int main(int argc, char *argv[]) {
int thread_pid;
for (int i = 0; i < 1; ++i) {
thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \
SIGCHLD, NULL);
}
while(1);
return 0;
}

/*-------------------------- end --------------------------------*/

Link: https://lkml.kernel.org/r/1626517201-24086-1-git-send-email-nglaive@gmail.com
Signed-off-by: Yutian Yang <nglaive@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <shenwenbo@zju.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

51cc3a66 02-Sep-2021 Hugh Dickins <hughd@google.com>

fs, mm: fix race in unlinking swapfile

We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off. That is not supposed to be possible.

2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.

may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.

Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.

Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
Fixes: f9454548e17c ("don't unlink an active swapfile")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9857a17f 02-Sep-2021 John Hubbard <jhubbard@nvidia.com>

mm/gup: remove try_get_page(), call try_get_compound_head() directly

try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7490a2d2 02-Sep-2021 Shakeel Butt <shakeelb@google.com>

writeback: memcg: simplify cgroup_writeback_by_id

Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
pages for a memcg. However mem_cgroup_wb_stats() does a lot more than
just get the number of dirty pages. Just directly get the number of dirty
pages instead of calling mem_cgroup_wb_stats(). Also
cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
remove the unused 'nr' parameter and no need to explicitly flush memcg
stats.

Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7ae12c80 02-Sep-2021 Johannes Weiner <hannes@cmpxchg.org>

fs: inode: count invalidated shadow pages in pginodesteal

pginodesteal is supposed to capture the impact that inode reclaim has on
the page cache state. Currently, it doesn't consider shadow pages that
get dropped this way, even though this can have a significant impact on
paging behavior, memory pressure calculations etc.

To improve visibility into these effects, make sure shadow pages get
counted when they get dropped through inode reclaim.

This changes the return value semantics of invalidate_mapping_pages()
semantics slightly, but the only two users are the inode shrinker itsel
and a usb driver that logs it for debugging purposes.

Link: https://lkml.kernel.org/r/20210614211904.14420-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

16e2df2a 02-Sep-2021 Johannes Weiner <hannes@cmpxchg.org>

fs: drop_caches: fix skipping over shadow cache inodes

When drop_caches truncates the page cache in an inode it also includes any
shadow entries for evicted pages. However, there is a preliminary check
on whether the inode has pages: if it has *only* shadow entries, it will
skip running truncation on the inode and leave it behind.

Fix the check to mapping_empty(), such that it runs truncation on any
inode that has cache entries at all.

Link: https://lkml.kernel.org/r/20210614211904.14420-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fee468fd 02-Sep-2021 Jan Kara <jack@suse.cz>

writeback: reliably update bandwidth estimation

Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

633a2abb 02-Sep-2021 Jan Kara <jack@suse.cz>

writeback: track number of inodes under writeback

Patch series "writeback: Fix bandwidth estimates", v4.

Fix estimate of writeback throughput when device is not fully busy doing
writeback. Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

This patch (of 4):

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.

Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9673e005 02-Sep-2021 Gang He <ghe@suse.com>

ocfs2: ocfs2_downconvert_lock failure results in deadlock

Usually, ocfs2_downconvert_lock() function always downconverts dlm lock to
the expected level for satisfy dlm bast requests from the other nodes.

But there is a rare situation. When dlm lock conversion is being
canceled, ocfs2_downconvert_lock() function will return -EBUSY. You need
to be aware that ocfs2_cancel_convert() function is asynchronous in fsdlm
implementation.

If we does not requeue this lockres entry, ocfs2 downconvert thread no
longer handles this dlm lock bast request. Then, the other nodes will not
get the dlm lock again, the current node's process will be blocked when
acquire this dlm lock again.

Link: https://lkml.kernel.org/r/20210830044621.12544-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6c85c2c7 02-Sep-2021 Tuo Li <islituo@gmail.com>

ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info()

A memory block is allocated through kmalloc(), and its return value is
assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not
initialized but it is accessed in:
iput(oinfo->dqi_gqinode);

To fix this possible uninitialized-variable access, assign NULL to
oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the
assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init()
in ocfs2_global_read_info().

Link: https://lkml.kernel.org/r/20210804031832.57154-1-islituo@gmail.com
Signed-off-by: Tuo Li <islituo@gmail.com>
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2f566394 02-Sep-2021 Dan Carpenter <dan.carpenter@oracle.com>

ocfs2: remove an unnecessary condition

The case where "tmp_oh" is NULL is handled at the start of the function.
At this point we know it's non-NULL so this will always return 1.

Link: https://lkml.kernel.org/r/YOcItgIXtisi3MaO@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Larry Chen <lchen@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8d0920bd 21-Apr-2021 David Hildenbrand <david@redhat.com>

mm: remove VM_DENYWRITE

All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

4589ff7c 23-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: remove in-tree usage of MAP_DENYWRITE

At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.

In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.

While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
"Overlayfs did not honor positive i_writecount on realfile for
VM_DENYWRITE mappings." [1]

[1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

fe69d560 23-Apr-2021 David Hildenbrand <david@redhat.com>

kernel/fork: always deny write access to current MM exe_file

We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.

Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

42be8b42 21-Apr-2021 David Hildenbrand <david@redhat.com>

binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()

uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().

For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.

With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.

This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

31efe48e 03-Sep-2021 Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

io_uring: fix possible poll event lost in multi shot mode

IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode,
that means once one pure poll request returns one event(cqe), we'll
need to read or write continually until EAGAIN is returned, then I think
there is a possible poll event lost race in multi shot mode:

t1 poll request add | |
t2 | |
t3 event happens | |
t4 task work add | |
t5 | task work run |
t6 | commit one cqe |
t7 | | user app handles cqe
t8 | new event happen |
t9 | add back to waitqueue |
t10 |

After t6 but before t9, if new event happens, there'll be no wakeup
operation, and if user app has picked up this cqe in t7, read or write
until EAGAIN is returned. In t8, new event happens and will be lost,
though this race window maybe small.

To fix this possible race, add poll request back to waitqueue before
committing cqe.

Fixes: 88e41cf928a6 ("io_uring: add multishot mode for IORING_OP_POLL_ADD")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210903142436.5767-1-xiaoguang.wang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8d4ad41e 01-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: prolong tctx_task_work() with flushing

io_submit_flush_completions() may enqueue linked requests for task_work
execution, so don't leave tctx_task_work() right after the tw list is
exhausted, but try to flush and then retry.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

63637853 01-Sep-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't disable kiocb_done() CQE batching

Not passing issue_flags from kiocb_done() into __io_complete_rw() means
that completion batching for this case is disabled, e.g. for most of
buffered reads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fa84693b 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL

SQPOLL has a different thread doing submissions, we need to check for
that and use the right task context when updating the worker values.
Just hold the sqd->lock across the operation, this ensures that the
thread cannot go away while we poke at ->io_uring.

Link: https://github.com/axboe/liburing/issues/420
Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Tested-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05a444d3 29-Aug-2021 Colin Ian King <colin.king@canonical.com>

ceph: fix dereference of null pointer cf

Currently in the case where kmem_cache_alloc fails the null pointer
cf is dereferenced when assigning cf->is_capsnap = false. Fix this
by adding a null pointer check and return path.

Cc: stable@vger.kernel.org
Addresses-Coverity: ("Dereference null return")
Fixes: b2f9fa1f3bd8 ("ceph: correctly handle releasing an embedded cap flush")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a9c9a6f7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, qla2xxx,
target, smartpqi, lpfc, mpt3sas).

The core change causing the most churn was replacing the command
request field request with a macro, allowing us to offset map to it
and remove the redundant field; the same was also done for the tag
field.

The most impactful change is the final removal of scsi_ioctl, which
has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
scsi: ufs: ufs-exynos: Fix static checker warning
scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
scsi: lpfc: Copyright updates for 14.0.0.1 patches
scsi: lpfc: Update lpfc version to 14.0.0.1
scsi: lpfc: Add bsg support for retrieving adapter cmf data
scsi: lpfc: Add cmf_info sysfs entry
scsi: lpfc: Add debugfs support for cm framework buffers
scsi: lpfc: Add support for maintaining the cm statistics buffer
scsi: lpfc: Add rx monitoring statistics
scsi: lpfc: Add support for the CM framework
scsi: lpfc: Add cmfsync WQE support
scsi: lpfc: Add support for cm enablement buffer
scsi: lpfc: Add cm statistics buffer support
scsi: lpfc: Add EDC ELS support
scsi: lpfc: Expand FPIN and RDF receive logging
scsi: lpfc: Add MIB feature enablement support
scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
scsi: fc: Add EDC ELS definition
...


9f358999 26-Jan-2021 Jeff Layton <jlayton@kernel.org>

ceph: drop the mdsc_get_session/put_session dout messages

These are very chatty, racy, and not terribly useful. Just remove them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3eaf5aa1 02-Sep-2021 Jeff Layton <jlayton@kernel.org>

ceph: lockdep annotations for try_nonblocking_invalidate

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a76d0a9c 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: don't WARN if we're forcibly removing the session caps

For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42ad631b 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: don't WARN if we're force umounting

Force umount will try to close the sessions by setting the session
state to _CLOSING. We don't want to WARN in this situation, since it's
expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a6d37ccd 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: remove the capsnaps when removing caps

capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b11ed503 11-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: request Fw caps before updating the mtime in ceph_write_iter

The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.

This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.

Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.

URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d517b398 17-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: reconnect to the export targets on new mdsmaps

In the case where the export MDS has crashed just after the EImportStart
journal is flushed, a standby MDS takes over for it and when replaying
the EImportStart journal the MDS will wait the client to reconnect. That
may never happen because the client may not have registered or opened
the sessions yet.

When receiving a new map, ensure we reconnect to valid export targets as
well if their sessions don't exist yet.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

692e1715 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: print more information when we can't find snaprealm

Print a bit more information when we can't find the realm during
ceph_add_cap. Show both the inode number and the old realm inode
number.

Suggested-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0ba92e1c 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: add ceph_change_snap_realm() helper

Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c80dc3ae 27-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: remove redundant initializations from mdsc and session

The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so
there's no need to explicitly initialize either of their fields to 0.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b4002173 27-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: cancel delayed work instead of flushing on mdsc teardown

The first thing metric_delayed_work does is check mdsc->stopping,
and then return immediately if it's set. That's good since we would
have already torn down the metric structures at this point, otherwise,
but there is no locking around mdsc->stopping.

It's possible that the ceph_metric_destroy call could race with the
delayed_work, in which case we could end up with the delayed_work
accessing destroyed percpu variables.

At this point in the mdsc teardown, the "stopping" flag has already been
set, so there's no benefit to flushing the work. Move the work
cancellation in ceph_metric_destroy ahead of the percpu variable
destruction, and eliminate the flush_delayed_work call in
ceph_mdsc_destroy.

Fixes: 18f473b384a6 ("ceph: periodically send perf metrics to MDSes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

40e309de 26-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: add a new vxattr to return auth mds for an inode

Add a new vxattr that shows what MDS is authoritative for an inode (if
we happen to have auth caps). If we don't have an auth cap for the inode
then just return -1.

URL: https://tracker.ceph.com/issues/1276
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

49f8899e 06-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: remove some defunct forward declarations

We missed these in the recent fscache rework.

Reported-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

e1a4541e 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: flush the mdlog before waiting on unsafe reqs

For the client requests who will have unsafe and safe replies from
MDS daemons, in the MDS side the MDS daemons won't flush the mdlog
(journal log) immediatelly, because they think it's unnecessary.
That's true for most cases but not all, likes the fsync request.
The fsync will wait until all the unsafe replied requests to be
safely replied.

Normally if there have multiple threads or clients are running, the
whole mdlog in MDS daemons could be flushed in time if any request
will trigger the mdlog submit thread. So usually we won't experience
the normal operations will stuck for a long time. But in case there
has only one client with only thread is running, the stuck phenomenon
maybe obvious and the worst case it must wait at most 5 seconds to
wait the mdlog to be flushed by the MDS's tick thread periodically.

This patch will trigger to flush the mdlog in the relevant and auth
MDSes to which the in-flight requests are sent just before waiting
the unsafe requests to finish.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d095559c 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: flush mdlog before umounting

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

59b312f3 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: make iterate_sessions a global symbol

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fba97e80 04-Jul-2021 Xiubo Li <xiubli@redhat.com>

ceph: make ceph_create_session_msg a global symbol

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ce3a8732 14-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix comment about short copies in ceph_write_end

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2ad32cf0 01-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix memory leak on decode error in ceph_handle_caps

If we hit a decoding error late in the frame, then we might exit the
function without putting the pool_ns string. Ensure that we always put
that reference on the way out of the function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c815f04b 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit updates from Shuah Khan:
"This KUnit update for Linux 5.15-rc1 adds new features and tests:

Tool:

- support for '--kernel_args' to allow setting module params

- support for '--raw_output' option to show just the kunit output
during make

Tests:

- new KUnit tests for checksums and timestamps

- Print test statistics on failure

- Integrates UBSAN into the KUnit testing framework. It fails KUnit
tests whenever it reports undefined behavior"

* tag 'linux-kselftest-kunit-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Print test statistics on failure
kunit: tool: make --raw_output support only showing kunit output
kunit: tool: add --kernel_args to allow setting module params
kunit: ubsan integration
fat: Add KUnit tests for checksums and timestamps


eceae1e7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:

- fix a race in configfs_lookup (Sishuai Gong)

- minor cleanups (me)

* tag 'configfs-5.15' of git://git.infradead.org/users/hch/configfs:
configfs: fix a race in configfs_lookup()
configfs: fold configfs_attach_attr into configfs_lookup
configfs: simplify the configfs_dirent_is_ready
configfs: return -ENAMETOOLONG earlier in configfs_lookup


265113f7 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
"This set includes a number of minor fixes and cleanups related to the
networking changes in the last release.

A patch to delay ack messages reduces network traffic significantly"

* tag 'dlm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: avoid comms shutdown delay in release_lockspace
fs: dlm: fix return -EINTR on recovery stopped
fs: dlm: implement delayed ack handling
fs: dlm: move receive loop into receive handler
fs: dlm: fix multiple empty writequeue alloc
fs: dlm: generic connect func
fs: dlm: auto load sctp module
fs: dlm: introduce generic listen
fs: dlm: move to static proto ops
fs: dlm: introduce con_next_wq helper
fs: dlm: cleanup and remove _send_rcom
fs: dlm: clear CF_APP_LIMITED on close
fs: dlm: fix typo in tlv prefix
fs: dlm: use READ_ONCE for config var
fs: dlm: use sk->sk_socket instead of con->sock


3146cba9 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: make worker creation resilient against signals

If a task is queueing async work and also handling signals, then we can
run into the case where create_io_thread() is interrupted and returns
failure because of that. If this happens for creating the first worker
in a group, then that worker will never get created and we can hang the
ring.

If we do get a fork failure, retry from task_work. With signals we have
to be a bit careful as we cannot simply queue as task_work, as we'll
still have signals pending at that point. Punt over a normal workqueue
first and then create from task_work after that.

Lastly, ensure that we handle fatal worker creations. Worker creation
failures are normally not fatal, only if we fail to create one in an empty
worker group can we not make progress. Right now that is ignored, ensure
that we handle that and run cancel on the work item.

There are two paths that create new workers - one is the "existing worker
going to sleep", and the other is "no workers found for this work, create
one". The former is never fatal, as workers do exist in the group. Only
the latter needs to be carefully handled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

05c5f4ee 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: get rid of FIXED worker flag

It makes the logic easier to follow if we just get rid of the fixed worker
flag, and simply ensure that we never exit the last worker in the group.
This also means that no particular worker is special.

Just track the last timeout state, and if we have hit it and no work
is pending, check if there are other workers. If yes, then we can exit
this one safely.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

b0cfcdd9 16-Jul-2021 Linus Torvalds <torvalds@linux-foundation.org>

d_path: make 'prepend()' fill up the buffer exactly on overflow

Instead of just marking the buffer as having overflowed, fill it up as
much as we can. That will allow the overflow case to then return
whatever truncated result if it wants to.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

111c1aa8 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
"In addition to some ext4 bug fixes and cleanups, this cycle we add the
orphan_file feature, which eliminates bottlenecks when doing a large
number of parallel truncates and file deletions, and move the discard
operation out of the jbd2 commit thread when using the discard mount
option, to better support devices with slow discard operations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: make the updating inode data procedure atomic
ext4: remove an unnecessary if statement in __ext4_get_inode_loc()
ext4: move inode eio simulation behind io completeion
ext4: Improve scalability of ext4 orphan file handling
ext4: Orphan file documentation
ext4: Speedup ext4 orphan inode handling
ext4: Move orphan inode handling into a separate file
ext4: Support for checksumming from journal triggers
ext4: fix race writing to an inline_data file while its xattrs are changing
jbd2: add sparse annotations for add_transaction_credits()
ext4: fix sparse warnings
ext4: Make sure quota files are not grabbed accidentally
ext4: fix e2fsprogs checksum failure for mounted filesystem
ext4: if zeroout fails fall back to splitting the extent node
ext4: reduce arguments of ext4_fc_add_dentry_tlv
ext4: flush background discard kwork when retry allocation
ext4: get discard out of jbd2 commit kthread contex
ext4: remove the repeated comment of ext4_trim_all_free
ext4: add new helper interface ext4_try_to_trim_range()
ext4: remove the 'group' parameter of ext4_trim_extent
...


815409a1 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

- Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

- Improve performance by enabling RCU lookup.

- Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument. The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: enable RCU'd ->get_acl()
vfs: add rcu argument to ->get_acl() callback
ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
ovl: use kvalloc in xattr copy-up
ovl: update ctime when changing fileattr
ovl: skip checking lower file's i_writecount on truncate
ovl: relax lookup error on mismatch origin ftype
ovl: do not set overlay.opaque for new directories
ovl: add ovl_allow_offline_changes() helper
ovl: disable decoding null uuid with redirect_dir
ovl: consistent behavior for immutable/append-only inodes
ovl: copy up sync/noatime fileattr flags
ovl: pass ovl_fs to ovl_check_setxattr()
fs: add generic helper for filling statx attribute flags


412106c2 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
"In this cycle, direct I/O and fsdax support for uncompressed files are
now added in order to avoid double-caching for loop device and VM
container use cases. All uncompressed cases are now turned into iomap
infrastructure, which looks much simpler and cleaner.

In addition, fiemap support is added for both (un)compressed files by
using iomap infrastructure as well so end users can easily get file
distribution. We've also added chunk-based uncompressed files support
for data deduplication as the next step of VM container use cases.

Summary:

- support direct I/O for all uncompressed files

- support fsdax for non-tailpacking regular files

- use iomap infrastructure for all uncompressed cases

- support fiemap for both (un)compressed files

- introduce chunk-based files for chunk deduplication

- some cleanups"

* tag 'erofs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix double free of 'copied'
erofs: support reading chunk-based uncompressed files
erofs: introduce chunk-based file on-disk format
erofs: add fiemap support with iomap
erofs: add support for the full decompressed length
erofs: remove the mapping parameter from erofs_try_to_free_cached_page()
erofs: directly use wrapper erofs_page_is_managed() when shrinking
erofs: convert all uncompressed cases to iomap
erofs: dax support for non-tailpacking regular file
erofs: iomap support for non-tailpacking DIO


89594c74 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:
"Preparatory work for the fscache rewrite that's being worked on and
fix some bugs. These include:

- Always select netfs stats when enabling fscache stats since they're
displayed through the same procfile.

- Add a cookie debug ID that can be used in tracepoints instead of a
pointer and cache it in the netfs_cache_resources struct rather
than in the netfs_read_request struct to make it more available.

- Use file_inode() in cachefiles rather than dereferencing
file->f_inode directly.

- Provide a procfile to display fscache cookies.

- Remove the fscache and cachefiles histogram procfiles.

- Remove the fscache object list procfile.

- Avoid using %p in fscache and cachefiles as the value is hashed and
not comparable to the register dump in an oops trace.

- Fix the cookie hash function to actually achieve useful dispersion.

- Fix fscache_cookie_put() so that it doesn't dereference the cookie
pointer in the tracepoint after the refcount has been decremented
(we're only allowed to do that if we decremented it to zero).

- Use refcount_t rather than atomic_t for the fscache_cookie
refcount"

* tag 'fscache-next-20210829' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Use refcount_t for the cookie refcount instead of atomic_t
fscache: Fix fscache_cookie_put() to not deref after dec
fscache: Fix cookie key hashing
cachefiles: Change %p in format strings to something else
fscache: Change %p in format strings to something else
fscache: Remove the object list procfile
fscache, cachefiles: Remove the histogram stuff
fscache: Procfile to display cookies
fscache: Add a cookie debug ID and use that in traces
cachefiles: Use file_inode() rather than accessing ->f_inode
netfs: Move cookie debug ID to struct netfs_cache_resources
fscache: Select netfs stats if fscache stats are enabled


2e3a51b5 29-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Change how module init/info messages are displayed

Usually in file system init() messages are only displayed in info level.
Change level from notice to info, but keep CONFIG_NTFS3_64BIT_CLUSTER in
notice level. Also this need even more attention so let's put big
warning here so that nobody will not try accidentally use it.

There is also no good reason to display internal stuff like binary tree
search. This is always on option which can only disabled for debugging
purposes by developer. Also this message does not even check if
developer has disabled it or not so it is useless info.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

989e795b 26-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove GPL boilerplates from decompress lib files

Files already have SDPX identifier so no reason to keep boilerplates in
these files anymore.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

dd854e4b 25-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter

This check will be also performed in generic_file_read_iter() so we do
not want to check this two times in a row.

This was founded with Smatch
fs/ntfs3/file.c:803 ntfs_file_read_iter()
warn: unused return: count = iov_iter_count()

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

d4e8e135 25-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()

Use fiemap_prep() to check valid flags. It also shrink request scope
(@len) to what the fs can actually handle.

This address following Smatch static checker warning:
fs/ntfs3/frecord.c:1894 ni_fiemap()
warn: potential integer overflow from user 'vbo + len'

Because fiemap_prep() shrinks @len this cannot happened anymore.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: lore.kernel.org/ntfs3/20210825080440.GA17407@kili/

Fixes: 4342306f0f0d ("fs/ntfs3: Add file operations and implementation")
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

1fd95c05 02-Sep-2021 Theodore Ts'o <tytso@mit.edu>

ext4: add error checking to ext4_ext_replay_set_iblocks()

If the call to ext4_map_blocks() fails due to an corrupted file
system, ext4_ext_replay_set_iblocks() can get stuck in an infinite
loop. This could be reproduced by running generic/526 with a file
system that has inline_data and fast_commit enabled. The system will
repeatedly log to the console:

EXT4-fs warning (device dm-3): ext4_block_to_path:105: block 1074800922 > max in inode 131076

and the stack that it gets stuck in is:

ext4_block_to_path+0xe3/0x130
ext4_ind_map_blocks+0x93/0x690
ext4_map_blocks+0x100/0x660
skip_hole+0x47/0x70
ext4_ext_replay_set_iblocks+0x223/0x440
ext4_fc_replay_inode+0x29e/0x3b0
ext4_fc_replay+0x278/0x550
do_one_pass+0x646/0xc10
jbd2_journal_recover+0x14a/0x270
jbd2_journal_load+0xc4/0x150
ext4_load_journal+0x1f3/0x490
ext4_fill_super+0x22d4/0x2c00

With this patch, generic/526 still fails, but system is no longer
locking up in a tight loop. It's likely the root casue is that
fast_commit replay is corrupting file systems with inline_data, and we
probably need to add better error handling in the fast commit replay
code path beyond what is done here, which essentially just breaks the
infinite loop without reporting the to the higher levels of the code.

Fixes: 8016E29F4362 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Cc: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

90c90cda 02-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
"There's a lot in this cycle.

Starting with bug fixes: To avoid livelocks between the logging code
and the quota code, we've disabled the ability of quotaoff to turn off
quota accounting. (Admins can still disable quota enforcement, but
truly turning off accounting requires a remount.) We've tried to do
this in a careful enough way that there shouldn't be any user visible
effects aside from quotaoff no longer randomly hanging the system.

We've also fixed some bugs in runtime log behavior that could trip up
log recovery if (otherwise unrelated) transactions manage to start and
commit concurrently; some bugs in the GETFSMAP ioctl where we would
incorrectly restrict the range of records output if the two xfs
devices are of different sizes; a bug that resulted in fallocate
funshare failing unnecessarily; and broken behavior in the xfs inode
cache when DONTCACHE is in play.

As for new features: we now batch inode inactivations in percpu
background threads, which sharply decreases frontend thread wait time
when performing file deletions and should improve overall directory
tree deletion times. This eliminates both the problem where closing an
unlinked file (especially on a frozen fs) can stall for a long time,
and should also ease complaints about direct reclaim bogging down on
unlinked file cleanup.

Starting with this release, we've enabled pipelining of the XFS log.
On workloads with high rates of metadata updates to different shards
of the filesystem, multiple threads can be used to format committed
log updates into log checkpoints.

Lastly, with this release, two new features have graduated to
supported status: inode btree counters (for faster mounts), and
support for dates beyond Y2038. Expect these to be enabled by default
in a future release of xfsprogs.

Summary:

- Fix a potential log livelock on busy filesystems when there's so
much work going on that we can't finish a quotaoff before filling
up the log by removing the ability to disable quota accounting.

- Introduce the ability to use per-CPU data structures in XFS so that
we can do a better job of maintaining CPU locality for certain
operations.

- Defer inode inactivation work to per-CPU lists, which will help us
batch that processing. Deletions of large sparse files will
*appear* to run faster, but all that means is that we've moved the
work to the backend.

- Drop the EXPERIMENTAL warnings from the y2038+ support and the
inode btree counters, since it's been nearly a year and no
complaints have come in.

- Remove more of our bespoke kmem* variants in favor of using the
standard Linux calls.

- Prepare for the addition of log incompat features in upcoming
cycles by actually adding code to support this.

- Small cleanups of the xattr code in preparation for landing support
for full logging of extended attribute updates in a future cycle.

- Replace the various log shutdown state and flag code all over xfs
with a single atomic bit flag.

- Fix a serious log recovery bug where log item replay can be skipped
based on the start lsn of a transaction even though the transaction
commit lsn is the key data point for that by enforcing start lsns
to appear in the log in the same order as commit lsns.

- Enable pipelining in the code that pushes log items to disk.

- Drop ->writepage.

- Fix some bugs in GETFSMAP where the last fsmap record reported for
a device could extend beyond the end of the device, and a separate
bug where query keys for one device could be applied to another.

- Don't let GETFSMAP query functions edit their input parameters.

- Small cleanups to the scrub code's handling of perag structures.

- Small cleanups to the incore inode tree walk code.

- Constify btree function parameters that aren't changed, so that
there will never again be confusion about range query functions
changing their input parameters.

- Standardize the format and names of tracepoint data attributes.

- Clean up all the mount state and feature flags to use wrapped
bitset functions instead of inconsistently open-coded flag checks.

- Fix some confusion between xfs_buf hash table key variable vs.
block number.

- Fix a mis-interaction with iomap where we reported shared delalloc
cow fork extents to iomap, which would cause the iomap unshare
operation to return IO errors unnecessarily.

- Fix DONTCACHE behavior"

* tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (103 commits)
xfs: fix I_DONTCACHE
xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
xfs: fix perag structure refcounting error when scrub fails
xfs: rename buffer cache index variable b_bn
xfs: convert bp->b_bn references to xfs_buf_daddr()
xfs: introduce xfs_buf_daddr()
xfs: kill xfs_sb_version_has_v3inode()
xfs: introduce xfs_sb_is_v5 helper
xfs: remove unused xfs_sb_version_has wrappers
xfs: convert xfs_sb_version_has checks to use mount features
xfs: convert scrub to use mount-based feature checks
xfs: open code sb verifier feature checks
xfs: convert xfs_fs_geometry to use mount feature checks
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown
xfs: convert remaining mount flags to state flags
xfs: convert mount flags to features
xfs: consolidate mount option features in m_features
xfs: replace xfs_sb_version checks with feature flag checks
xfs: reflect sb features in xfs_mount
xfs: rework attr2 feature and mount options
...


bcfeebbf 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull exit cleanups from Eric Biederman:
"In preparation of doing something about PTRACE_EVENT_EXIT I have
started cleaning up various pieces of code related to do_exit. Most of
that code I did not manage to get tested and reviewed before the merge
window opened but a handful of very useful cleanups are ready to be
merged.

The first change is simply the removal of the bdflush system call. The
code has now been disabled long enough that even the oldest userspace
working userspace setups anyone can find to test are fine with the
bdflush system call being removed.

Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
calling do_exit directly is interesting only in that it is nearly the
most difficult of the incorrect uses of do_exit to remove.

The change to the seccomp code to simply send a signal instead of
calling do_coredump directly is a very nice little cleanup made
possible by realizing the existing signal sending helpers were missing
a little bit of functionality that is easy to provide"

* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal/seccomp: Dump core when there is only one live thread
signal/seccomp: Refactor seccomp signal and coredump generation
signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
exit/bdflush: Remove the deprecated bdflush system call


48983701 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull siginfo si_trapno updates from Eric Biederman:
"The full set of si_trapno changes was not appropriate as a fix for the
newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
related cleanups.

This is the rest of the cleanups for si_trapno that reduces it from
being a really weird arch special case that is expect to be always
present (but isn't) on the architectures that support it to being yet
another field in the _sigfault union of struct siginfo.

The changes have been reviewed and marinated in linux-next. With the
removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
that works across architectures should be easier to write and
maintain"

* 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
signal: Verify the alignment and size of siginfo_t
signal: Remove the generic __ARCH_SI_TRAPNO support
signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
arm64: Add compile-time asserts for siginfo_t offsets
arm: Add compile-time asserts for siginfo_t offsets
sparc64: Add compile-time asserts for siginfo_t offsets


15e20db2 01-Sep-2021 Jens Axboe <axboe@kernel.dk>

io-wq: only exit on fatal signals

If the application uses io_uring and also relies heavily on signals
for communication, that can cause io-wq workers to spuriously exit
just because the parent has a signal pending. Just ignore signals
unless they are fatal.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

f95dc207 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: split bounded and unbounded work into separate lists

We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48c13d ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecd95673 26-Aug-2021 Alexander Aring <aahringo@redhat.com>

fs: dlm: avoid comms shutdown delay in release_lockspace

When dlm_release_lockspace does active shutdown on connections to
other nodes, the active shutdown will wait for any exisitng passive
shutdowns to be resolved. But, the sequence of operations during
dlm_release_lockspace can prevent the normal resolution of passive
shutdowns (processed normally by way of lockspace recovery.)
This disruption of passive shutdown handling can cause the active
shutdown to wait for a full timeout period, delaying the completion
of dlm_release_lockspace.

To fix this, make dlm_release_lockspace resolve existing passive
shutdowns (by calling dlm_clear_members earlier), before it does
active shutdowns. The active shutdowns will not find any passive
shutdowns to wait for, and will not be delayed.

Reported-by: Chris Mackowski <cmackows@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>

c6c3c570 01-Sep-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
"Here is the big set of driver core patches for 5.15-rc1.

These do change a number of different things across different
subsystems, and because of that, there were 2 stable tags created that
might have already come into your tree from different pulls that did
the following

- changed the bus remove callback to return void

- sysfs iomem_get_mapping rework

Other than those two things, there's only a few small things in here:

- kernfs performance improvements for huge numbers of sysfs users at
once

- tiny api cleanups

- other minor changes

All of these have been in linux-next for a while with no reported
problems, other than the before-mentioned merge issue"

* tag 'driver-core-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (33 commits)
MAINTAINERS: Add dri-devel for component.[hc]
driver core: platform: Remove platform_device_add_properties()
ARM: tegra: paz00: Handle device properties with software node API
bitmap: extend comment to bitmap_print_bitmask/list_to_buf
drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI
topology: use bin_attribute to break the size limitation of cpumap ABI
lib: test_bitmap: add bitmap_print_bitmask/list_to_buf test cases
cpumask: introduce cpumap_print_list/bitmask_to_buf to support large bitmask and list
sysfs: Rename struct bin_attribute member to f_mapping
sysfs: Invoke iomem_get_mapping() from the sysfs open callback
debugfs: Return error during {full/open}_proxy_open() on rmmod
zorro: Drop useless (and hardly used) .driver member in struct zorro_dev
zorro: Simplify remove callback
sh: superhyway: Simplify check in remove callback
nubus: Simplify check in remove callback
nubus: Make struct nubus_driver::remove return void
kernfs: dont call d_splice_alias() under kernfs node lock
kernfs: use i_lock to protect concurrent inode updates
kernfs: switch kernfs to use an rwsem
kernfs: use VFS negative dentry caching
...


9605f75c 30-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: should put a page beyond EOF when preparing a write

The prepare_compress_overwrite() gets/locks a page to prepare a read, and calls
f2fs_read_multi_pages() which checks EOF first. If there's any page beyond EOF,
we unlock the page and set cc->rpages[i] = NULL, which we can't put the page
anymore. This makes page leak, so let's fix by putting that page.

Fixes: a949dc5f2c5c ("f2fs: compress: fix race condition of overwrite vs truncate")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

827f0284 30-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: deallocate compressed pages when error happens

In f2fs_write_multi_pages(), f2fs_compress_pages() allocates pages for
compression work in cc->cpages[]. Then, f2fs_write_compressed_pages() initiates
bio submission. But, if there's any error before submitting the IOs like early
f2fs_cp_error(), previously it didn't free cpages by f2fs_compress_free_page().
Let's fix memory leak by putting that just before deallocating cc->cpages.

Fixes: 4c8ff7095bef ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

0242f642 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix queue stalling race

We need to set the stalled bit early, before we drop the lock for adding
us to the stall hash queue. If not, then we can race with new work being
queued between adding us to the stall hash and io_worker_handle_work()
marking us stalled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

927bc120 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull close_range() cleanup from Christian Brauner:
"This is a cleanup for close_range() which was sent as part of a bugfix
we did some time ago in commit 9b5b872215fe ("file: fix close_range()
for unshare+cloexec").

We used to share more code between some helpers for close_range()
which made retrieving the maximum number of open fds before calling
into the helpers sensible. But with the introduction of
CLOSE_RANGE_CLOEXEC and the need to retrieve the number of maximum fds
once more for CLOSE_RANGE_CLOEXEC that stopped making sense. So the
code was in a dumb in-limbo state.

Fix this by simplifying the code a bit.

The original idea was to only fix the bug itself and make backporting
easy. And since the cleanup wasn't very pressing I left it in
linux-next for a very long time. I didn't pull the patches from the
list again back then which is why they don't have lore-links. So I'm
listing them below explicitly"

Commit 03ba0fe4d09f ("file: simplify logic in __close_range()")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-3-brauner@kernel.org

Commit f49fd6d3c070 ("file: let pick_file() tell caller it's done")
Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-4-brauner@kernel.org

* tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
file: simplify logic in __close_range()
file: let pick_file() tell caller it's done


1dd5915a 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs.move_mount.move_mount_set_group.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull move_mount updates from Christian Brauner:
"This contains an extension to the move_mount() syscall making it
possible to add a single private mount into an existing propagation
tree.

The use-case comes from the criu folks which have been struggling with
restoring complex mount trees for a long time. Variations of this work
have been discussed at Plumbers before, e.g.

https://www.linuxplumbersconf.org/event/7/contributions/640/

The extension to move_mount() enables criu to restore any set of mount
namespaces, mount trees and sharing group trees without introducing
yet more complexity into mount propagation itself.

The changes required to criu to make use of this and restore complex
propagation trees are available at

https://github.com/Snorch/criu/commits/mount-v2-poc

A cleaned-up version of this will go up for merging into the main criu
repo after this lands"

* tag 'fs.move_mount.move_mount_set_group.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest
move_mount: allow to add a mount into an existing group


0ee7c3e2 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"The most notable externally visible change for this cycle is the
addition of support for reads to inline tail fragments of files, which
was requested by the erofs developers; and a correction for a kernel
memory corruption bug if the sysadmin tries to activate a swapfile
with more pages than the swapfile header suggests.

We also now report writeback completion errors to the file mapping
correctly, instead of munging all errors into EIO.

Internally, the bulk of the changes are Christoph's patchset to reduce
the indirect function call count by a third to a half by converting
iomap iteration from a loop pattern to a generator/consumer pattern.
As an added bonus, fsdax no longer open-codes iomap apply loops.

Summary:

- Simplify the bio_end_page usage in the buffered IO code.

- Support reading inline data at nonzero offsets for erofs.

- Fix some typos and bad grammar.

- Convert kmap_atomic usage in the inline data read path.

- Add some extra inline data input checking.

- Fix a memory corruption bug stemming from iomap_swapfile_activate
trying to activate more pages than mm was expecting.

- Pass errnos through the page writeback code so that writeback
errors are reported correctly instead of being munged to EIO.

- Replace iomap_apply with a open-coded iterator loops to reduce the
number of indirect calls by a third to a half.

- Refactor the fsdax code to use iomap iterators instead of the
open-coded iomap_apply code that it had before.

- Format file range iomap tracepoint data in hexadecimal and
standardize the names used in the pretty-print string"

* tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits)
iomap: standardize tracepoint formatting and storage
mm/swap: consider max pages in iomap_swapfile_add_extent
iomap: move loop control code to iter.c
iomap: constify iomap_iter_srcmap
fsdax: switch the fault handlers to use iomap_iter
fsdax: factor out a dax_fault_actor() helper
fsdax: factor out helpers to simplify the dax fault code
iomap: rework unshare flag
iomap: pass an iomap_iter to various buffered I/O helpers
iomap: remove iomap_apply
fsdax: switch dax_iomap_rw to use iomap_iter
iomap: switch iomap_swapfile_activate to use iomap_iter
iomap: switch iomap_seek_data to use iomap_iter
iomap: switch iomap_seek_hole to use iomap_iter
iomap: switch iomap_bmap to use iomap_iter
iomap: switch iomap_fiemap to use iomap_iter
iomap: switch __iomap_dio_rw to use iomap_iter
iomap: switch iomap_page_mkwrite to use iomap_iter
iomap: switch iomap_zero_range to use iomap_iter
iomap: switch iomap_file_unshare to use iomap_iter
...


916d636e 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'vfs-5.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull project quota update from Darrick Wong:
"A single VFS patch that prevents userspace from setting project quota
ids on files that the VFS considers invalid"

* tag 'vfs-5.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: forbid invalid project ID


8bda9557 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
"New features:

- Support for server-side disconnect injection via debugfs

- Protocol definitions for new RPC_AUTH_TLS authentication flavor

Performance improvements:

- Reduce page allocator traffic in the NFSD splice read actor

- Reduce CPU utilization in svcrdma's Send completion handler

Notable bug fixes:

- Stabilize lockd operation when re-exporting NFS mounts

- Fix the use of %.*s in NFSD tracepoints

- Fix /proc/sys/fs/nfs/nsm_use_hostnames"

* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...


b8ce1b9d 31-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't submit half-prepared drain request

[ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
[ 3784.910926] Call Trace:
[ 3784.910928] ? io_read+0x17c/0x480
[ 3784.910945] io_issue_sqe+0xcb/0x1840
[ 3784.910953] __io_queue_sqe+0x44/0x300
[ 3784.910959] io_req_task_submit+0x27/0x70
[ 3784.910962] tctx_task_work+0xeb/0x1d0
[ 3784.910966] task_work_run+0x61/0xa0
[ 3784.910968] io_run_task_work_sig+0x53/0xa0
[ 3784.910975] __x64_sys_io_uring_enter+0x22/0x30
[ 3784.910977] do_syscall_64+0x3d/0x90
[ 3784.910981] entry_SYSCALL_64_after_hwframe+0x44/0xae

io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
submitting under-prepared request (e.g. failed in io_init_req(). Fail
such drained requests as well.

Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6d3d9cb 31-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix queueing half-created requests

[ 27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
[ 27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[ 27.263730] RIP: 0010:sock_from_file+0x20/0x90
[ 27.272444] Call Trace:
[ 27.272736] io_sendmsg+0x98/0x600
[ 27.279216] io_issue_sqe+0x498/0x68d0
[ 27.281142] __io_queue_sqe+0xab/0xb50
[ 27.285830] io_req_task_submit+0xbf/0x1b0
[ 27.286306] tctx_task_work+0x178/0xad0
[ 27.288211] task_work_run+0xe2/0x190
[ 27.288571] exit_to_user_mode_prepare+0x1a1/0x1b0
[ 27.289041] syscall_exit_to_user_mode+0x19/0x50
[ 27.289521] do_syscall_64+0x48/0x90
[ 27.289871] entry_SYSCALL_64_after_hwframe+0x44/0xae

io_req_complete_failed() -> io_req_complete_post() ->
io_req_task_queue() still would try to enqueue hard linked request,
which can be half prepared (e.g. failed init), so we can't allow
that to happen.

Fixes: a8295b982c46d ("io_uring: fix failed linkchain code logic")
Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

08bdbd39 31-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: ensure that hash wait lock is IRQ disabling

A previous commit removed the IRQ safety of the worker and wqe locks,
but that left one spot of the hash wait lock now being done without
already having IRQs disabled.

Ensure that we use the right locking variant for the hashed waitqueue
lock.

Fixes: a9a4aa9fbfc5 ("io-wq: wqe and worker locks no longer need to be IRQ safe")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7db30437 21-Aug-2021 Ming Lei <ming.lei@redhat.com>

io_uring: retry in case of short read on block device

In case of buffered reading from block device, when short read happens,
we should retry to read more, otherwise the IO will be completed
partially, for example, the following fio expects to read 2MB, but it
can only read 1M or less bytes:

fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting

Fix the issue by allowing short read retry for block device, which sets
FMODE_BUF_RASYNC really.

Fixes: 9a173346bd9e ("io_uring: fix short read retries for non-reg files")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b3188e7 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: IORING_OP_WRITE needs hash_reg_file set

During some testing, it became evident that using IORING_OP_WRITE doesn't
hash buffered writes like the other writes commands do. That's simply
an oversight, and can cause performance regressions when doing buffered
writes with this command.

Correct that and add the flag, so that buffered writes are correctly
hashed when using the non-iovec based write command.

Cc: stable@vger.kernel.org
Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94ffb0a2 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix race between adding work and activating a free worker

The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.

Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4529fb15 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'gfs2-v5.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

- Various withdraw related fixes (freeze glock recursion, thread
initialization / destruction order, journal recovery, glock cleanup,
withdraw under journal lock).

- Some error message improvements.

- Various minor cleanups.

* tag 'gfs2-v5.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Remove redundant check from gfs2_glock_dq
gfs2: Delay withdraw from atomic context
gfs2: Don't call dlm after protocol is unmounted
gfs2: don't stop reads while withdraw in progress
gfs2: Mark journal inodes as "don't cache"
gfs2: nit: gfs2_drop_inode shouldn't return bool
gfs2: Eliminate vestigial HIF_FIRST
gfs2: Make recovery error more readable
gfs2: Don't release and reacquire local statfs bh
gfs2: init system threads before freeze lock
gfs2: tiny cleanup in gfs2_log_reserve
gfs2: trivial clean up of gfs2_ail_error
gfs2: be more verbose replaying invalid rgrp blocks
gfs2: Fix glock recursion in freeze_go_xmote_bh
gfs2: Fix memory leak of object lsi on error return path


cd358208 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
"Some small fixes and cleanups for fs/crypto/:

- Fix ->getattr() for ext4, f2fs, and ubifs to report the correct
st_size for encrypted symlinks

- Use base64url instead of a custom Base64 variant

- Document struct fscrypt_operations"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: document struct fscrypt_operations
fscrypt: align Base64 encoding with RFC 4648 base64url
fscrypt: remove mention of symlink st_size quirk from documentation
ubifs: report correct st_size for encrypted symlinks
f2fs: report correct st_size for encrypted symlinks
ext4: report correct st_size for encrypted symlinks
fscrypt: add fscrypt_symlink_getattr() for computing st_size


87045e65 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.

There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.

Features:

- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity

- idmapped mount support

- make mount with rescue=ibadroots more tolerant to partially damaged
trees

- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles

- zoned mode block group auto reclaim can be disabled via sysfs knob

Performance improvements:

- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)

- batching of delayed items can speed up creating many files

- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)

Fixes:

- various zoned mode fixes

- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems

Core:

- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount

- improved flushing logic, fixing early ENOSPC on some workloads

- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity

- new tree items for fs-verity
- descriptor item
- Merkle tree item

- inode operations extended to be namespace-aware

- cleanups and refactoring

Generic code changes:

- fs: new export filemap_fdatawrite_wbc

- fs: removed sync_inode

- block: bio_trim argument type fixups

- vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...


9c849ce8 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client updates from Steve French:
"Eleven cifs/smb3 client fixes:

- mostly restructuring to allow disabling less secure algorithms
(this will allow eventual removing rc4 and md4 from general use in
the kernel)

- four fixes, including two for stable

- enable r/w support with fscache and cifs.ko

I am working on a larger set of changes (the usual ... multichannel,
auth and signing improvements), but wanted to get these in earlier to
reduce chance of merge conflicts later in the merge window"

* tag '5.15-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED
cifs: add cifs_common directory to MAINTAINERS file
cifs: cifs_md4 convert to SPDX identifier
cifs: create a MD4 module and switch cifs.ko to use it
cifs: fork arc4 and create a separate module for it for cifs and other users
cifs: remove support for NTLM and weaker authentication algorithms
cifs: enable fscache usage even for files opened as rw
oid_registry: Add OIDs for missing Spnego auth mechanisms to Macs
smb3: fix posix extensions mount option
cifs: fix wrong release in sess_alloc_buffer() failed path
CIFS: Fix a potencially linear read overflow


e24c567b 31-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag '5.15-rc-first-ksmbd-merge' of git://git.samba.org/ksmbd

Pull initial ksmbd implementation from Steve French:
"Initial merge of kernel smb3 file server, ksmbd.

The SMB family of protocols is the most widely deployed network
filesystem protocol, the default on Windows and Macs (and even on many
phones and tablets), with clients and servers on all major operating
systems, but lacked a kernel server for Linux. For many cases the
current userspace server choices were suboptimal either due to memory
footprint, performance or difficulty integrating well with advanced
Linux features.

ksmbd is a new kernel module which implements the server-side of the
SMB3 protocol. The target is to provide optimized performance, GPLv2
SMB server, and better lease handling (distributed caching). The
bigger goal is to add new features more rapidly (e.g. RDMA aka
"smbdirect", and recent encryption and signing improvements to the
protocol) which are easier to develop on a smaller, more tightly
optimized kernel server than for example in Samba.

The Samba project is much broader in scope (tools, security services,
LDAP, Active Directory Domain Controller, and a cross platform file
server for a wider variety of purposes) but the user space file server
portion of Samba has proved hard to optimize for some Linux workloads,
including for smaller devices.

This is not meant to replace Samba, but rather be an extension to
allow better optimizing for Linux, and will continue to integrate well
with Samba user space tools and libraries where appropriate. Working
with the Samba team we have already made sure that the configuration
files and xattrs are in a compatible format between the kernel and
user space server.

Various types of functional and regression tests are regularly run
against it. One example is the automated 'buildbot' regression tests
which use the Linux client to test against ksmbd, e.g.

http://smb3-test-rhel-75.southcentralus.cloudapp.azure.com/#/builders/8/builds/56

but other test suites, including Samba's smbtorture functional test
suite are also used regularly"

* tag '5.15-rc-first-ksmbd-merge' of git://git.samba.org/ksmbd: (219 commits)
ksmbd: fix __write_overflow warning in ndr_read_string
MAINTAINERS: ksmbd: add cifs_common directory to ksmbd entry
MAINTAINERS: ksmbd: update my email address
ksmbd: fix permission check issue on chown and chmod
ksmbd: don't set FILE DELETE and FILE_DELETE_CHILD in access mask by default
MAINTAINERS: add git adddress of ksmbd
ksmbd: update SMB3 multi-channel support in ksmbd.rst
ksmbd: smbd: fix kernel oops during server shutdown
ksmbd: remove select FS_POSIX_ACL in Kconfig
ksmbd: use proper errno instead of -1 in smb2_get_ksmbd_tcon()
ksmbd: update the comment for smb2_get_ksmbd_tcon()
ksmbd: change int data type to boolean
ksmbd: Fix multi-protocol negotiation
ksmbd: fix an oops in error handling in smb2_open()
ksmbd: add ipv6_addr_v4mapped check to know if connection from client is ipv4
ksmbd: fix missing error code in smb2_lock
ksmbd: use channel signingkey for binding SMB2 session setup
ksmbd: don't set RSS capable in FSCTL_QUERY_NETWORK_INTERFACE_INFO
ksmbd: Return STATUS_OBJECT_PATH_NOT_FOUND if smb2_creat() returns ENOENT
ksmbd: fix -Wstringop-truncation warnings
...


d3624466 31-Aug-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Restyle comments to better align with kernel-doc

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

78ab59fe 31-Aug-2021 Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fs/ntfs3: Rework file operations

Rename now works "Add new name and remove old name".
"Remove old name and add new name" may result in bad inode
if we can't add new name and then can't restore (add) old name.

Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

a97131c2 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now

For some reason we have FAT ioctl calls. Even old ntfs driver did not
use these. We should not use these because it his hard to get things out
of kernel when they are upstream. That's why we remove these for now.

More discussion is needed what ioctl should be implemented and what is
important.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

59bda8ec 31-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

fuse: flush extending writes

Callers of fuse_writeback_range() assume that the file is ready for
modification by the server in the supplied byte range after the call
returns.

If there's a write that extends the file beyond the end of the supplied
range, then the file needs to be extended to at least the end of the range,
but currently that's not done.

There are at least two cases where this can cause problems:

- copy_file_range() will return short count if the file is not extended
up to end of the source range.

- FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
hence the region may not be fully allocated.

Fix by flushing writes from the start of the range up to the end of the
file. This could be optimized if the writes are non-extending, etc, but
it's probably not worth the trouble.

Fixes: a2bc92362941 ("fuse: fix copy_file_range() in the writeback case")
Fixes: 6b1bdb56b17c ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
Cc: <stable@vger.kernel.org> # v5.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

baaae979 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: make the updating inode data procedure atomic

Now that ext4_do_update_inode() return error before filling the whole
inode data if we fail to set inode blocks in ext4_inode_blocks_set().
This error should never happen in theory since sb->s_maxbytes should not
have allowed this, we have already init sb->s_maxbytes according to this
feature in ext4_fill_super(). So even through that could only happen due
to the filesystem corruption, we'd better to return after we finish
updating the inode because it may left an uninitialized buffer and we
could read this buffer later in "errors=continue" mode.

This patch make the updating inode data procedure atomic, call
EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad
happened, make sure that the inode is integrated, and also drop a BUG_ON
and do some small cleanups.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

8e33fadf 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: remove an unnecessary if statement in __ext4_get_inode_loc()

The "if (!buffer_uptodate(bh))" hunk covered almost the whole code after
getting buffer in __ext4_get_inode_loc() which seems unnecessary, remove
it and switch to check ext4_buffer_uptodate(), it simplify code and make
it more readable.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

0904c9ae 26-Aug-2021 Zhang Yi <yi.zhang@huawei.com>

ext4: move inode eio simulation behind io completeion

No EIO simulation is required if the buffer is uptodate, so move the
simulation behind read bio completeion just like inode/block bitmap
simulation does.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210826130412.3921207-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

4a79a98c 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Improve scalability of ext4 orphan file handling

Even though the length of the critical section when adding / removing
orphaned inodes was significantly reduced by using orphan file, the
contention of lock protecting orphan file still appears high in profiles
for truncate / unlink intensive workloads with high number of threads.

This patch makes handling of orphan file completely lockless. Also to
reduce conflicts between CPUs different CPUs start searching for empty
slot in orphan file in different blocks.

Performance comparison of locked orphan file handling, lockless orphan
file handling, and completely disabled orphan inode handling
from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on
SAS SSD disk, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time Time Time
Orphan locked Orphan lockless No orphan
1 0.945600 0.939400 0.891200
2 1.331800 1.246600 1.174400
4 1.995000 1.780600 1.713200
8 6.424200 4.900000 4.106000
16 14.937600 8.516400 8.138000
32 33.038200 24.565600 24.002200
64 60.823600 39.844600 38.440200
128 122.941400 70.950400 69.315000

So we can see that with lockless orphan file handling, addition /
deletion of orphaned inodes got almost completely out of picture even
for a microbenchmark stressing it.

For reaim creat_clo workload on ramdisk there are also noticeable gains
(average of 5 runs):

Clients Vanilla (ops/s) Patched (ops/s)
creat_clo-1 14705.88 ( 0.00%) 14354.07 * -2.39%*
creat_clo-3 27108.43 ( 0.00%) 28301.89 ( 4.40%)
creat_clo-5 37406.48 ( 0.00%) 45180.73 * 20.78%*
creat_clo-7 41338.58 ( 0.00%) 54687.50 * 32.29%*
creat_clo-9 45226.13 ( 0.00%) 62937.07 * 39.16%*
creat_clo-11 44000.00 ( 0.00%) 65088.76 * 47.93%*
creat_clo-13 36516.85 ( 0.00%) 68661.97 * 88.03%*
creat_clo-15 30864.20 ( 0.00%) 69551.78 * 125.35%*
creat_clo-17 27478.45 ( 0.00%) 67729.08 * 146.48%*
creat_clo-19 25000.00 ( 0.00%) 61621.62 * 146.49%*
creat_clo-21 18772.35 ( 0.00%) 63829.79 * 240.02%*
creat_clo-23 16698.94 ( 0.00%) 61938.96 * 270.92%*
creat_clo-25 14973.05 ( 0.00%) 56947.61 * 280.33%*
creat_clo-27 16436.69 ( 0.00%) 65008.03 * 295.51%*
creat_clo-29 13949.01 ( 0.00%) 69047.62 * 395.00%*
creat_clo-31 14283.52 ( 0.00%) 67982.45 * 375.95%*

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

02f310fc 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Speedup ext4 orphan inode handling

Ext4 orphan inode handling is a bottleneck for workloads which heavily
truncate / unlink small files since it contends on the global
s_orphan_mutex lock (and generally it's difficult to improve scalability
of the ondisk linked list of orphaned inodes).

This patch implements new way of handling orphan inodes. Instead of
linking orphaned inode into a linked list, we store it's inode number in
a new special file which we call "orphan file". Only if there's no more
space in the orphan file (too many inodes are currently orphaned) we
fall back to using old style linked list. Currently we protect
operations in the orphan file with a spinlock for simplicity but even in
this setting we can substantially reduce the length of the critical
section and thus speedup some workloads. In the next patch we improve
this by making orphan handling lockless.

Note that the change is backwards compatible when the filesystem is
clean - the existence of the orphan file is a compat feature, we set
another ro-compat feature indicating orphan file needs scanning for
orphaned inodes when mounting filesystem read-write. This ro-compat
feature gets cleared on unmount / remount read-only.

Some performance data from 80 CPU Xeon Server with 512 GB of RAM,
filesystem located on SSD, average of 5 runs:

stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)

Threads Time Time
Vanilla Patched
1 1.057200 0.945600
2 1.680400 1.331800
4 2.547000 1.995000
8 7.049400 6.424200
16 14.827800 14.937600
32 40.948200 33.038200
64 87.787400 60.823600
128 206.504000 122.941400

So we can see significant wins all over the board.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

25c6d98f 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Move orphan inode handling into a separate file

Move functions for handling orphan inodes into a new file
fs/ext4/orphan.c to have them in one place and somewhat reduce size of
other files. No code changes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

b33d9f59 14-Aug-2021 Theodore Ts'o <tytso@mit.edu>

jbd2: add sparse annotations for add_transaction_credits()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

188c299e 16-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Support for checksumming from journal triggers

JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.

So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

a5fda113 14-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix sparse warnings

Add sparse annotations to suppress false positive context imbalance
warnings, and use NULL instead of 0 in EXT_MAX_{EXTENT,INDEX}.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>

a54c4613 20-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: fix race writing to an inline_data file while its xattrs are changing

The location of the system.data extended attribute can change whenever
xattr_sem is not taken. So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().

This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.

Cc: stable@kernel.org
Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data")
Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

bd2c38cf 12-Aug-2021 Jan Kara <jack@suse.cz>

ext4: Make sure quota files are not grabbed accidentally

If ext4 filesystem is corrupted so that quota files are linked from
directory hirerarchy, bad things can happen. E.g. quota files can get
corrupted or deleted. Make sure we are not grabbing quota file inodes
when we expect normal inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210812133122.26360-1-jack@suse.cz

b2bbb92f 12-Aug-2021 Jan Kara <jack@suse.cz>

ext4: fix e2fsprogs checksum failure for mounted filesystem

Commit 81414b4dd48 ("ext4: remove redundant sb checksum
recomputation") removed checksum recalculation after updating
superblock free space / inode counters in ext4_fill_super() based on
the fact that we will recalculate the checksum on superblock
writeout.

That is correct assumption but until the writeout happens (which can
take a long time) the checksum is incorrect in the buffer cache and if
programs such as tune2fs or resize2fs is called shortly after a file
system is mounted can fail. So return back the checksum recalculation
and add a comment explaining why.

Fixes: 81414b4dd48f ("ext4: remove redundant sb checksum recomputation")
Cc: stable@kernel.org
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210812124737.21981-1-jack@suse.cz

308c57cc 13-Aug-2021 Theodore Ts'o <tytso@mit.edu>

ext4: if zeroout fails fall back to splitting the extent node

If the underlying storage device is using thin-provisioning, it's
possible for a zeroout operation to return ENOSPC.

Commit df22291ff0fd ("ext4: Retry block allocation if we have free blocks
left") added logic to retry block allocation since we might get free block
after we commit a transaction. But the ENOSPC from thin-provisioning
will confuse ext4, and lead to an infinite loop.

Since using zeroout instead of splitting the extent node is an
optimization, if it fails, we might as well fall back to splitting the
extent node.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

facec450 27-Jul-2021 Guoqing Jiang <jiangguoqing@kylinos.cn>

ext4: reduce arguments of ext4_fc_add_dentry_tlv

Let's pass fc_dentry directly since those arguments (tag, parent_ino and
ino etc) can be deferenced from it.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210727080708.3708814-1-guoqing.jiang@linux.dev

5036ab8d 30-Aug-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: flush background discard kwork when retry allocation

The background discard kwork tries to mark blocks used and issue
discard. This can make filesystem suffer from NOSPC error, xfstest
generic/371 can fail due to it. Fix it by flushing discard kwork
in ext4_should_retry_alloc. At the same time, give up discard at
the moment.

Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

55cdd0af 24-Jul-2021 Wang Jianchao <wangjianchao@kuaishou.com>

ext4: get discard out of jbd2 commit kthread contex

Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.

One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.

This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
- mark blocks to be discarded as used if they have not been allocated
- do discard
- mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.

Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com>
Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

b91db6a0 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"

* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()


3b629f8d 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block

Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.

It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"

* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio


c547d89a 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

- cancellation cleanups (Hao, Pavel)

- io-wq accounting cleanup (Hao)

- io_uring submit locking fix (Hao)

- io_uring link handling fixes (Hao)

- fixed file improvements (wangyangbo, Pavel)

- allow updates of linked timeouts like regular timeouts (Pavel)

- IOPOLL fix (Pavel)

- remove batched file get optimization (Pavel)

- improve reference handling (Pavel)

- IRQ task_work batching (Pavel)

- allow pure fixed file, and add support for open/accept (Pavel)

- GFP_ATOMIC RT kernel fix

- multiple CQ ring waiter improvement

- funnel IRQ completions through task_work

- add support for limiting async workers explicitly

- add different clocksource support for timeouts

- io-wq wakeup race fix

- lots of cleanups and improvement (Pavel et al)

* tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits)
io-wq: fix wakeup race when adding new work
io-wq: wqe and worker locks no longer need to be IRQ safe
io-wq: check max_worker limits if a worker transitions bound state
io_uring: allow updating linked timeouts
io_uring: keep ltimeouts in a list
io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts
io-wq: provide a way to limit max number of workers
io_uring: add build check for buf_index overflows
io_uring: clarify io_req_task_cancel() locking
io_uring: add task-refs-get helper
io_uring: fix failed linkchain code logic
io_uring: remove redundant req_set_fail()
io_uring: don't free request to slab
io_uring: accept directly into fixed file table
io_uring: hand code io_accept() fd installing
io_uring: openat directly into fixed fd table
net: add accept helper not installing fd
io_uring: fix io_try_cancel_userdata race for iowq
io_uring: IRQ rw completion batching
io_uring: batch task work locking
...


67936911 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
"Nothing major in here - lots of good cleanups and tech debt handling,
which is also evident in the diffstats. In particular:

- Add disk sequence numbers (Matteo)

- Discard merge fix (Ming)

- Relax disk zoned reporting restrictions (Niklas)

- Bio error handling zoned leak fix (Pavel)

- Start of proper add_disk() error handling (Luis, Christoph)

- blk crypto fix (Eric)

- Non-standard GPT location support (Dmitry)

- IO priority improvements and cleanups (Damien)o

- blk-throtl improvements (Chunguang)

- diskstats_show() stack reduction (Abd-Alrhman)

- Loop scheduler selection (Bart)

- Switch block layer to use kmap_local_page() (Christoph)

- Remove obsolete disk_name helper (Christoph)

- block_device refcounting improvements (Christoph)

- Ensure gendisk always has a request queue reference (Christoph)

- Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
sg: pass the device name to blk_trace_setup
block, bfq: cleanup the repeated declaration
blk-crypto: fix check for too-large dun_bytes
blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
block: mark blkdev_fsync static
block: refine the disk_live check in del_gendisk
mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
mmc: block: Support alternative_gpt_sector() operation
partitions/efi: Support non-standard GPT location
block: Add alternative_gpt_sector() operation
bio: fix page leak bio_add_hw_page failure
block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
block: remove a pointless call to MINOR() in device_add_disk
null_blk: add error handling support for add_disk()
virtio_blk: add error handling support for add_disk()
block: add error handling for device_add_disk / add_disk
block: return errors from disk_alloc_events
block: return errors from blk_integrity_add
block: call blk_register_queue earlier in device_add_disk
...


8596e589 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and related drivers:

Core code:

- Cure a couple of correctness issues in the posix CPU timer code to
prevent that the tick dependency for NOHZ full is kept alive for no
reason.

- Avoid expensive double reprogramming of the clockevent device in
hrtimer_start_range_ns().

- Avoid pointless SMP function calls when the clock was set to avoid
disturbing CPUs which do not have any affected timers queued.

- Make the clocksource watchdog test work correctly when CONFIG_HZ is
less than 100.

Drivers:

- Prefer the ARM architected timer over the Exynos timer which is way
more expensive to access.

- Add device tree bindings for new Ingenic SoCs

- The usual improvements and cleanups all over the place"

* tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
clocksource: Make clocksource watchdog test safe for slow-HZ systems
dt-bindings: timer: Add ABIs for new Ingenic SoCs
clocksource/drivers/fttmr010: Pass around less pointers
clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown
clocksource/drivers/ingenic: Use bitfield macro helpers
clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
dt-bindings: timer: convert rockchip,rk-timer.txt to YAML
clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU
clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64
hrtimer: Unbreak hrtimer_force_reprogram()
hrtimer: Use raw_cpu_ptr() in clock_was_set()
hrtimer: Avoid more SMP function calls in clock_was_set()
hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
hrtimer: Add bases argument to clock_was_set()
time/timekeeping: Avoid invoking clock_was_set() twice
timekeeping: Distangle resume and clock-was-set events
timerfd: Provide timerfd_resume()
hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
hrtimer: Ensure timerfd notification for HIGHRES=n
hrtimer: Consolidate reprogramming code
...


5d3c0db4 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks
on AArch32 systems that also have 64-bit-only CPUs.

Architectures can fill in this functionality by defining their own
task_cpu_possible_mask(p). When this is done, the scheduler will make
sure the task will only be scheduled on CPUs that support it.

(The actual arm64 specific changes are not part of this tree.)

For other architectures there will be no change in functionality.

- Add cgroup SCHED_IDLE support

- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)

- Deadline scheduler enhancements & fixes

- Misc fixes & cleanups.

* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
eventfd: Make signal recursion protection a task bit
sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
sched: Replace deprecated CPU-hotplug functions.
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
sched: Fix UCLAMP_FLAG_IDLE setting
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
sched/fair: Avoid a second scan of target in select_idle_cpu
sched/fair: Use prev instead of new target as recent_used_cpu
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
...


6f01c935 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
"This starts with a couple of fixes for potential deadlocks in the
fowner/fasync handling.

The next patch removes the old mandatory locking code from the kernel
altogether.

The last patch cleans up rw_verify_area a bit more after the mandatory
locking removal"

* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: clean up after mandatory file locking support removal
fs: remove mandatory file locking support
fcntl: fix potential deadlock for &fasync_struct.fa_lock
fcntl: fix potential deadlocks for &fown_struct.lock


aa99f3c2 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.

This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex


8cfb9015 27-Aug-2021 Trond Myklebust <trond.myklebust@hammerspace.com>

NFS: Always provide aligned buffers to the RPC read layers

Instead of messing around with XDR padding in the RDMA layer, we should
just give the RPC layer an aligned buffer. Try to avoid creating extra
RPC calls by aligning to the smaller value of ALIGN(len, rsize) and
PAGE_SIZE.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

a1ca8e71 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF and isofs updates from Jan Kara:
"Several smaller fixes and cleanups in UDF and isofs"

* tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf_get_extendedattr() had no boundary checks.
isofs: joliet: Fix iocharset=utf8 mount option
udf: Fix iocharset=utf8 mount option
udf: Get rid of 0-length arrays in struct fileIdentDesc
udf: Get rid of 0-length arrays
udf: Remove unused declaration
udf: Check LVID earlier


63b0c403 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull FIEMAP cleanups from Jan Kara:
"FIEMAP cleanups from Christoph transitioning all remaining filesystems
supporting FIEMAP (ext2, hpfs) to iomap API and removing the old
helper"

* tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: remove generic_block_fiemap
hpfs: use iomap_fiemap to implement ->fiemap
ext2: use iomap_fiemap to implement ->fiemap
ext2: make ext2_iomap_ops available unconditionally


f7db8dd6 29-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: enable realtime discard iff device supports discard

Let's only enable realtime discard if and only if device supports
discard functionality.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

dddd3d65 19-Aug-2021 Jaegeuk Kim <jaegeuk@kernel.org>

f2fs: guarantee to write dirty data when enabling checkpoint back

We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

c8dc3047 25-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix to unmap pages from userspace process in punch_hole()

We need to unmap pages from userspace process before removing pagecache
in punch_hole() like we did in f2fs_setattr().

Similar change:
commit 5e44f8c374dc ("ext4: hole-punch use truncate_pagecache_range")

Fixes: fbfa2cc58d53 ("f2fs: add file operations")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

adf9ea89 25-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()

In below path, it will return ENOENT if filesystem is shutdown:

- f2fs_map_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_page
- __get_node_page
- read_node_page
- is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
return -ENOENT
- force return value from ENOENT to 0

It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.

However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.

There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.

xfstest generic/388 reports below warning:

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
iomap_apply+0x1c4/0x7b0
? dax_iomap_rw+0x1c0/0x1c0
dax_iomap_rw+0xad/0x1c0
? dax_iomap_rw+0x1c0/0x1c0
f2fs_file_write_iter+0x5ab/0x970 [f2fs]
do_iter_readv_writev+0x273/0x2e0
do_iter_write+0xab/0x1f0
vfs_iter_write+0x21/0x40
iter_file_splice_write+0x287/0x540
do_splice+0x37c/0xa60
__x64_sys_splice+0x15f/0x3a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
dax_iomap_fault+0x44/0x70
f2fs_dax_huge_fault+0x155/0x400 [f2fs]
f2fs_dax_fault+0x18/0x30 [f2fs]
__do_fault+0x4e/0x120
do_fault+0x3cf/0x7a0
__handle_mm_fault+0xa8c/0xf20
? find_held_lock+0x39/0xd0
handle_mm_fault+0x1b6/0x480
do_user_addr_fault+0x320/0xcd0
? rcu_read_lock_sched_held+0x67/0xc0
exc_page_fault+0x77/0x3f0
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30

Fixes: 83a3bfdb5a8a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

ad126ebd 23-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: fix to account missing .skipped_gc_rwsem

There is a missing place we forgot to account .skipped_gc_rwsem, fix it.

Fixes: 6f8d4455060d ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

d75da8c8 23-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: adjust unlock order for cleanup

This patch adjusts unlock order of .i_mmap_sem and .i_gc_rwsem for
cleanup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

4d674904 19-Aug-2021 Fengnan Chang <changfengnan@vivo.com>

f2fs: Don't create discard thread when device doesn't support realtime discard

Don't create discard thread when device doesn't support realtime discard
or user specifies nodiscard mount option.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

35134319 30-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
"fsnotify speedups when notification actually isn't used and support
for identifying processes which caused fanotify events through pidfd
instead of normal pid"

* tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: optimize the case of no marks of any type
fsnotify: count all objects with attached connectors
fsnotify: count s_fsnotify_inode_refs for attached connectors
fsnotify: replace igrab() with ihold() on attach connector
fanotify: add pidfd support to the fanotify API
fanotify: introduce a generic info record copying helper
fanotify: minor cosmetic adjustments to fid labels
kernel/pid.c: implement additional checks upon pidfd_create() parameters
kernel/pid.c: remove static qualifier from pidfd_create()


e8b8e97f 03-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Restyle comments to better align with kernel-doc

Capitalize comments and end with period for better reading.

Also function comments are now little more kernel-doc style. This way we
can easily convert them to kernel-doc style if we want. Note that these
are not yet complete with this style. Example function comments start
with /* and in kernel-doc style they start /**.

Use imperative mood in function descriptions.

Change words like ntfs -> NTFS, linux -> Linux.

Use "we" not "I" when commenting code.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

87df7fb9 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: fix wakeup race when adding new work

When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:

CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed

Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9a4aa9f 30-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: wqe and worker locks no longer need to be IRQ safe

io_uring no longer queues async work off completion handlers that run in
hard or soft interrupt context, and that use case was the only reason that
io-wq had to use IRQ safe locks for wqe and worker locks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecc53c48 29-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: check max_worker limits if a worker transitions bound state

For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.

If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.

When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.

Cc: stable@vger.kernel.org
Reported-by: Johannes Lundberg <johalun0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f1042b6c 28-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: allow updating linked timeouts

We allow updating normal timeouts, add support for adjusting timings of
linked timeouts as well.

Reported-by: Victor Stewart <v@nametag.social>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef9dd637 28-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: keep ltimeouts in a list

A preparation patch. Keep all queued linked timeout in a list, so they
may be found and updated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

50c1df2b 27-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts

Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.

Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.

Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.

Link: https://github.com/axboe/liburing/issues/369
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2e480058 27-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: provide a way to limit max number of workers

io-wq divides work into two categories:

1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.

2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.

For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.

The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.

Reported-by: Johannes Lundberg <johalun0@gmail.com>
Link: https://github.com/axboe/liburing/issues/420
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b542e383 29-Jul-2021 Thomas Gleixner <tglx@linutronix.de>

eventfd: Make signal recursion protection a task bit

The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.

Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.

Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.

No functional change for !RT kernels.

Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx

2a7a451a 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

NFSv4.1 add network transport when session trunking is detected

After trunking is discovered in nfs4_discover_server_trunking(),
add the transport to the old client structure if the allowed limit
of transports has not been reached.

An example: there exists a multi-homed server and client mounts
one server address and some volume and then doest another mount to
a different address of the same server and perhaps a different
volume. Previously, the client checks that this is a session
trunkable servers (same server), and removes the newly created
client structure along with its transport. Now, the client
adds the connection from the 2nd mount into the xprt switch of
the existing client (it leads to having 2 available connections).

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

dc48e0ab 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

SUNRPC enforce creation of no more than max_connect xprts

If we are adding new transports via rpc_clnt_test_and_add_xprt()
then check if we've reached the limit. Currently only pnfs path
adds transports via that function but this is done in
preparation when the client would add new transports when
session trunking is detected. A warning is logged if the
limit is reached.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7e134205 27-Aug-2021 Olga Kornievskaia <kolga@netapp.com>

NFSv4 introduce max_connect mount options

This option will control up to how many xprts can the client
establish to the server with a distinct address (that means
nconnect connections are not counted towards this new limit).
This patch is setting up nfs structures to keeep track of the
max_connect limit (does not enforce it).

The default value is kept at 1 so that no current mounts that
don't want any additional connections would be effected. The
maximum value is set at 16.

Mounts to DS are not limited to default value of 1 but instead
set to the maximum default value of 16 (NFS_MAX_TRANSPORTS).

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

79d534f8 17-Aug-2021 Ye Bin <yebin10@huawei.com>

NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox

As eb96d5c97b08 ("SUNRPC handle EKEYEXPIRED in call_refreshresult")
commit handle EKEYEXPIRED in call_refreshresult, so there is only handle
when "task->tk_status" is equal "-EJUKEBOX" in nfs3_async_handle_jukebox.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7d5d8d71 26-Aug-2021 Namjae Jeon <namjae.jeon@samsung.com>

ksmbd: fix __write_overflow warning in ndr_read_string

Dan reported __write_overflow warning in ndr_read_string.

CC [M] fs/ksmbd/ndr.o
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:11,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/cpumask.h:5,
from ./arch/x86/include/asm/msr.h:11,
from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:60,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from ./include/linux/wait.h:9,
from ./include/linux/wait_bit.h:8,
from ./include/linux/fs.h:6,
from fs/ksmbd/ndr.c:7:
In function memcpy,
inlined from ndr_read_string at fs/ksmbd/ndr.c:86:2,
inlined from ndr_decode_dos_attr at fs/ksmbd/ndr.c:167:2:
./include/linux/fortify-string.h:219:4: error: call to __write_overflow
declared with attribute error: detected write beyond size of object
__write_overflow();
^~~~~~~~~~~~~~~~~~

This seems to be a false alarm because hex_attr size is always smaller
than n->length. This patch fix this warning by allocation hex_attr with
n->length.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

90499ad0 25-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add build check for buf_index overflows

req->buf_index is u16 and so we rely on registered buffers indexes
fitting into it. Add a build check, so when the upper limit for the
number of buffers is lifted we get a compliation fail but not lurking
problems.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b18a1a45 25-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clarify io_req_task_cancel() locking

It's too easy to forget and misjudge about synchronisation in
io_req_task_cancel(), add a comment clarifying it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71099083835f983a1fd73d5a3da6391924da8300.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b8155e95 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Fix error handling in indx_insert_into_root()

There are three bugs in this code:
1) If indx_get_root() fails, then return -EINVAL instead of success.
2) On the "/* make root external */" -EOPNOTSUPP; error path it should
free "re" but it has a memory leak.
3) If indx_new() fails then it will lead to an error pointer dereference
when we call put_indx_node().

I've re-written the error handling to be more clear.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

8c83a485 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Potential NULL dereference in hdr_find_split()

The "e" pointer is dereferenced before it has been checked for NULL.
Move the dereference after the NULL check to prevent an Oops.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

04810f00 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: Fix error code in indx_add_allocate()

Return -EINVAL if ni_find_attr() fails. Don't return success.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

2926e429 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: fix an error code in ntfs_get_acl_ex()

The ntfs_get_ea() function returns negative error codes or on success
it returns the length. In the original code a zero length return was
treated as -ENODATA and results in a NULL return. But it should be
treated as an invalid length and result in an PTR_ERR(-EINVAL) return.

Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

a1b04d38 24-Aug-2021 Dan Carpenter <dan.carpenter@oracle.com>

fs/ntfs3: add checks for allocation failure

Add a check for when the kzalloc() in init_rsttbl() fails. Some of
the callers checked for NULL and some did not. I went down the call
tree and added NULL checks where ever they were missing.

Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

345482bc 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc

Use kcalloc/kmalloc_array over kzalloc/kmalloc when we allocate array.
Checkpatch found these after we did not use our own defined allocation
wrappers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

195c52bd 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Do not use driver own alloc wrappers

Problem with these wrapper is that we cannot take off example GFP_NOFS
flag. It is not recomended use those in all places. Also if we change
one driver specific wrapper to kernel wrapper then it would look really
weird. People should be most familiar with kernel wrappers so let's just
use those ones.

Driver specific alloc wrapper also confuse some static analyzing tools,
good example is example kernels checkpatch tool. After we converter
these to kernel specific then warnings is showed.

Following Coccinelle script was used to automate changing.

virtual patch

@alloc depends on patch@
expression x;
expression y;
@@
(
- ntfs_malloc(x)
+ kmalloc(x, GFP_NOFS)
|
- ntfs_zalloc(x)
+ kzalloc(x, GFP_NOFS)
|
- ntfs_vmalloc(x)
+ kvmalloc(x, GFP_NOFS)
|
- ntfs_free(x)
+ kfree(x)
|
- ntfs_vfree(x)
+ kvfree(x)
|
- ntfs_memdup(x, y)
+ kmemdup(x, y, GFP_NOFS)
)

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

fa3cacf5 26-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use kernel ALIGN macros over driver specific

The static checkers (Smatch) were complaining because QuadAlign() was
buggy. If you try to align something higher than UINT_MAX it got
truncated to a u32.

Smatch warning was:
fs/ntfs3/attrib.c:383 attr_set_size_res()
warn: was expecting a 64 bit value instead of '~7'

So that this will not happen again we will change all these macros to
kernel made ones. This can also help some other static analyzing tools
to give us better warnings.

Patch was generated with Coccinelle script and after that some style
issue was hand fixed.

Coccinelle script:

virtual patch

@alloc depends on patch@
expression x;
@@
(
- #define QuadAlign(n) (((n) + 7u) & (~7u))
|
- QuadAlign(x)
+ ALIGN(x, 8)
|
- #define IsQuadAligned(n) (!((size_t)(n)&7u))
|
- IsQuadAligned(x)
+ IS_ALIGNED(x, 8)
|
- #define Quad2Align(n) (((n) + 15u) & (~15u))
|
- Quad2Align(x)
+ ALIGN(x, 16)
|
- #define IsQuad2Aligned(n) (!((size_t)(n)&15u))
|
- IsQuad2Aligned(x)
+ IS_ALIGNED(x, 16)
|
- #define Quad4Align(n) (((n) + 31u) & (~31u))
|
- Quad4Align(x)
+ ALIGN(x, 32)
|
- #define IsSizeTAligned(n) (!((size_t)(n) & (sizeof(size_t) - 1)))
|
- IsSizeTAligned(x)
+ IS_ALIGNED(x, sizeof(size_t))
|
- #define DwordAlign(n) (((n) + 3u) & (~3u))
|
- DwordAlign(x)
+ ALIGN(x, 4)
|
- #define IsDwordAligned(n) (!((size_t)(n)&3u))
|
- IsDwordAligned(x)
+ IS_ALIGNED(x, 4)
|
- #define WordAlign(n) (((n) + 1u) & (~1u))
|
- WordAlign(x)
+ ALIGN(x, 2)
|
- #define IsWordAligned(n) (!((size_t)(n)&1u))
|
- IsWordAligned(x)
+ IS_ALIGNED(x, 2)
|
)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

24516d48 24-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Restyle comment block in ni_parse_reparse()

First of this fix one none utf8 char in this comment block. Maybe
this happened because error in filesystem ;)

Also this block was hard to read because long lines so make it max 80
long. And while we doing this stuff make little better grammer.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

1263eddf 19-Aug-2021 Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

fs/ntfs3: Remove unused including <linux/version.h>

Eliminate the follow versioncheck warning:

./fs/ntfs3/inode.c: 16 linux/version.h not needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

abfeb2ee 18-Aug-2021 Gustavo A. R. Silva <gustavoars@kernel.org>

fs/ntfs3: Fix fall-through warnings for Clang

Fix the following fallthrough warnings:

fs/ntfs3/inode.c:1792:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/ntfs3/index.c:178:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]

This helps with the ongoing efforts to globally enable
-Wimplicit-fallthrough for Clang.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

be87e821 17-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Fix one none utf8 char in source file

In one source file there is for some reason non utf8 char. But hey this
is fs development so this kind of thing might happen.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

8c01308b 16-Aug-2021 Nathan Chancellor <nathan@kernel.org>

fs/ntfs3: Remove unused variable cnt in ntfs_security_init()

Clang warns:

fs/ntfs3/fsntfs.c:1874:9: warning: variable 'cnt' set but not used
[-Wunused-but-set-variable]
size_t cnt, off;
^
1 warning generated.

It is indeed unused so remove it.

Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

71eeb6ac 16-Aug-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Fix integer overflow in multiplication

The multiplication of the u32 data_size with a int is being performed
using 32 bit arithmetic however the results is being assigned to the
variable nbits that is a size_t (64 bit) value. Fix a potential
integer overflow by casting the u32 value to a size_t before the
multiply to use a size_t sized bit multiply operation.

Addresses-Coverity: ("Unintentional integer overflow")
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

87790b65 16-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Add ifndef + define to all header files

Add guards so that compiler will only include header files once.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

528c9b3d 16-Aug-2021 Kari Argillander <kari.argillander@gmail.com>

fs/ntfs3: Use linux/log2 is_power_of_2 function

We do not need our own implementation for this function in this
driver. It is much better to use generic one.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

f8d87ed9 16-Aug-2021 Colin Ian King <colin.king@canonical.com>

fs/ntfs3: Fix various spelling mistakes

There is a spelling mistake in a ntfs_err error message. Also
fix various spelling mistakes in comments.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

9a10867a 27-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add task-refs-get helper

As we have a more complicated task referencing, which apart from normal
task references includes taking tctx->inflight and caching all that, it
would be a good idea to have all that isolated in helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9114d037f1c195897aa13f38a496078eca2afdb.1630023531.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8295b98 27-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix failed linkchain code logic

Given a linkchain like this:
req0(link_flag)-->req1(link_flag)-->...-->reqn(no link_flag)

There is a problem:
- if some intermediate linked req like req1 's submittion fails, reqs
after it won't be cancelled.

- sqpoll disabled: maybe it's ok since users can get the error info
of req1 and stop submitting the following sqes.

- sqpoll enabled: definitely a problem, the following sqes will be
submitted in the next round.

The solution is to refactor the code logic to:
- if a linked req's submittion fails, just mark it and the head(if it
exists) as REQ_F_FAIL. Leverage req->result to indicate whether it
is failed or cancelled.
- submit or fail the whole chain when we come to the end of it.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210827094609.36052-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

14afdd6e 27-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: remove redundant req_set_fail()

req_set_fail() in io_submit_sqe() is redundant, remove it.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210827094609.36052-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

20ec197b 29-Mar-2021 David Howells <dhowells@redhat.com>

fscache: Use refcount_t for the cookie refcount instead of atomic_t

Use refcount_t for the fscache_cookie refcount instead of atomic_t and
rename the 'usage' member to 'ref' in such cases. The tracepoints that
reference it change from showing "u=%d" to "r=%d".

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431204358.2908479.8006938388213098079.stgit@warthog.procyon.org.uk/

33cba859 18-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Fix fscache_cookie_put() to not deref after dec

fscache_cookie_put() accesses the cookie it has just put inside the
tracepoint that monitors the change - but this is something it's not
allowed to do if we didn't reduce the count to zero.

Fix this by dropping most of those values from the tracepoint and grabbing
the cookie debug ID before doing the dec.

Also take the opportunity to switch over the usage and where arguments on
the tracepoint to put the reason last.

Fixes: a18feb55769b ("fscache: Add tracepoints")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431203107.2908479.3259582550347000088.stgit@warthog.procyon.org.uk/

35b72573 17-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Fix cookie key hashing

The current hash algorithm used for hashing cookie keys is really bad,
producing almost no dispersion (after a test kernel build, ~30000 files
were split over just 18 out of the 32768 hash buckets).

Borrow the full_name_hash() hash function into fscache to do the hashing
for cookie keys and, in the future, volume keys.

I don't want to use full_name_hash() as-is because I want the hash value to
be consistent across arches and over time as the hash value produced may
get used on disk.

I can also optimise parts of it away as the key will always be a padded
array of aligned 32-bit words.

Fixes: ec0328e46d6e ("fscache: Maintain a catalogue of allocated cookies")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431201844.2908479.8293647220901514696.stgit@warthog.procyon.org.uk/

8beabdde 19-Oct-2020 David Howells <dhowells@redhat.com>

cachefiles: Change %p in format strings to something else

Change plain %p in format strings in cachefiles code to something more
useful, since %p is now hashed before printing and thus no longer matches
the contents of an oops register dump.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/160588476042.3465195.6837847445880367183.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431200692.2908479.9253374494073633778.stgit@warthog.procyon.org.uk/

c97a72de 19-Oct-2020 David Howells <dhowells@redhat.com>

fscache: Change %p in format strings to something else

Change plain %p in format strings in fscache code to something more useful,
since %p is now hashed before printing and thus no longer matches the
contents of an oops register dump.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/160588474843.3465195.5446072310069374803.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431199509.2908479.2950631488219944294.stgit@warthog.procyon.org.uk/

58f386a7 12-May-2021 David Howells <dhowells@redhat.com>

fscache: Remove the object list procfile

Remove the object list procfile from fscache as objects will become
entirely internal to the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431198332.2908479.5847286163455099669.stgit@warthog.procyon.org.uk/

6ae9bd8b 12-May-2021 David Howells <dhowells@redhat.com>

fscache, cachefiles: Remove the histogram stuff

Remove the histogram stuff as it's mostly going to be outdated.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431195953.2908479.16770977195634296638.stgit@warthog.procyon.org.uk/

884a7688 10-Feb-2020 David Howells <dhowells@redhat.com>

fscache: Procfile to display cookies

Add /proc/fs/fscache/cookies to display active cookies.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/158861211871.340223.7223853943667440807.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465771021.1376105.6933857529128238020.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588460994.3465195.16963417803501149328.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431194785.2908479.786917990782538164.stgit@warthog.procyon.org.uk/

2908f5e1 10-Feb-2020 David Howells <dhowells@redhat.com>

fscache: Add a cookie debug ID and use that in traces

Add a cookie debug ID and use that in traces and in procfiles rather than
displaying the (hashed) pointer to the cookie. This is easier to correlate
and we don't lose anything when interpreting oops output since that shows
unhashed addresses and registers that aren't comparable to the hashed
values.

Changes:

ver #2:
- Fix the fscache_op tracepoint to handle a NULL cookie pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/158861210988.340223.11688464116498247790.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465769844.1376105.14119502774019865432.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588459097.3465195.1273313637721852165.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162431193544.2908479.17556704572948300790.stgit@warthog.procyon.org.uk/

bdd3c50d 26-Aug-2021 Christoph Hellwig <hch@lst.de>

dax: remove bdev_dax_supported

All callers already have a dax_device obtained from fs_dax_get_by_bdev
at hand, so just pass that to dax_supported() insted of doing another
lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

a384f088 26-Aug-2021 Christoph Hellwig <hch@lst.de>

xfs: factor out a xfs_buftarg_is_dax helper

Refactor the DAX setup code in preparation of removing
bdev_dax_supported.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6c97ec17 26-Aug-2021 Christoph Hellwig <hch@lst.de>

fsdax: improve the FS_DAX Kconfig description and help text

Rename the main option text to clarify it is for file system access,
and add a bit of text that explains how to actually switch a nvdimm
to a fsdax capable state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210826135510.6293-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

1568cb0e 06-Jul-2021 Johannes Berg <johannes.berg@intel.com>

hostfs: support splice_write

There's really no good reason not to, and e.g. trace-cmd
currently requires it for the temporary per-CPU files.
Hook up splice_write just like everyone else does.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

0bcc7ca4 24-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfsd: fix crash on LOCKT on reexported NFSv3

Unlike other filesystems, NFSv3 tries to use fl_file in the GETLK case.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bb0a55bb 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfs: don't allow reexport reclaims

In the reexport case, nfsd is currently passing along locks with the
reclaim bit set. The client sends a new lock request, which is granted
if there's currently no conflict--even if it's possible a conflicting
lock could have been briefly held in the interim.

We don't currently have any way to safely grant reclaim, so for now
let's just deny them all.

I'm doing this by passing the reclaim bit to nfs and letting it fail the
call, with the idea that eventually the client might be able to do
something more forgiving here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b840be2f 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

lockd: don't attempt blocking locks on nfs reexports

As in the v4 case, it doesn't work well to block waiting for a lock on
an nfs filesystem.

As in the v4 case, that means we're depending on the client to poll.
It's probably incorrect to depend on that, but I *think* clients do poll
in practice. In any case, it's an improvement over hanging the lockd
thread indefinitely as we currently are.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f657f8ee 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfs: don't atempt blocking locks on nfs reexports

NFS implements blocking locks by blocking inside its lock method. In
the reexport case, this blocks the nfs server thread, which could lead
to deadlocks since an nfs server thread might be required to unlock the
conflicting lock. It also causes a crash, since the nfs server thread
assumes it can free the lock when its lm_notify lock callback is called.

Ideal would be to make the nfs lock method return without blocking in
this case, but for now it works just not to attempt blocking locks. The
difference is just that the original client will have to poll (as it
does in the v4.0 case) instead of getting a callback when the lock's
available.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

97d8cc20 26-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
"Two memory management fixes for the filesystem"

* tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client:
ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
ceph: correctly handle releasing an embedded cap flush


9b49ceb8 26-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"One more fix that I think qualifies for a late merge. It's a revert of
a one-liner fix that meanwhile got backported to stable kernels and we
got reports from users.

The broken fix prevents creating compressed inline extents, which
could be noticeable on space consumption.

Technically it's a regression as the patch was merged in 5.14-rc1 but
got propagated to several stable kernels and has higher exposure than
a 'typical' development cycle bug"

* tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: compression: don't try to compress if we don't have enough pages"


03b8df8d 21-Aug-2021 Darrick J. Wong <djwong@kernel.org>

iomap: standardize tracepoint formatting and storage

Print all the offset, pos, and length quantities in hexadecimal. While
we're at it, update the types of the tracepoint structure fields to
match the types of the values being recorded in them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

3998f0b8 25-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETED

RHBZ: 1994393

If we hit a STATUS_USER_SESSION_DELETED for the Create part in the
Create/QueryDirectory compound that starts a directory scan
we will leak EDEADLK back to userspace and surprise glibc and the application.

Pick this up initiate_cifs_search() and retry a small number of tries before we
return an error to userspace.

Cc: stable@vger.kernel.org
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

38f4910b 23-Aug-2021 Steve French <stfrench@microsoft.com>

cifs: cifs_md4 convert to SPDX identifier

Add SPDX license identifier and replace license boilerplate
for cifs_md4.c

Signed-off-by: Steve French <stfrench@microsoft.com>

42c21973 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: create a MD4 module and switch cifs.ko to use it

MD4 support will likely be removed from the crypto directory, but
is needed for compression of NTLMSSP in SMB3 mounts.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

71c02863 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: fork arc4 and create a separate module for it for cifs and other users

We can not drop ARC4 and basically destroy CIFS connectivity for
almost all CIFS users so create a new forked ARC4 module that CIFS and other
subsystems that have a hard dependency on ARC4 can use.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

76a3c92e 19-Aug-2021 Ronnie Sahlberg <lsahlber@redhat.com>

cifs: remove support for NTLM and weaker authentication algorithms

for SMB1.
This removes the dependency to DES.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

18d04062 10-Aug-2021 Shyam Prasad N <sprasad@microsoft.com>

cifs: enable fscache usage even for files opened as rw

So far, the fscache implementation we had supports only
a small set of use cases. Particularly for files opened
with O_RDONLY.

This commit enables it even for rw based file opens. It
also enables the reuse of cached data in case of mount
option (cache=singleclient) where it is guaranteed that
this is the only client (and server) which operates on
the files. There's also a single line change in fscache.c
to get around a bug seen in fscache.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

7321be26 23-Aug-2021 Steve French <stfrench@microsoft.com>

smb3: fix posix extensions mount option

We were incorrectly initializing the posix extensions in the
conversion to the new mount API.

CC: <stable@vger.kernel.org> # 5.11+
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

d72c7419 17-Aug-2021 Ding Hui <dinghui@sangfor.com.cn>

cifs: fix wrong release in sess_alloc_buffer() failed path

smb_buf is allocated by small_smb_init_no_tc(), and buf type is
CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to
release it in failed path.

Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>

f980d055 16-Aug-2021 Len Baker <len.baker@gmx.com>

CIFS: Fix a potencially linear read overflow

strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated.

Also, the strnlen() call does not avoid the read overflow in the strlcpy
function when a not NUL-terminated string is passed.

So, replace this block by a call to kstrndup() that avoids this type of
overflow and does the same.

Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions")
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

0c6e1d7f 25-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: don't free request to slab

It's not necessary to free the request back to slab when we fail to
get sqe, just move it to state->free_list.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe67f4dd 24-Aug-2021 Linus Torvalds <torvalds@linux-foundation.org>

pipe: do FASYNC notifications for every pipe IO, not just state changes

It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.

Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable). User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.

But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).

The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.

This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong. What matters is that it used to work.

So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes. It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.

Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed. Probably because the stress-ng problem case ends up
being timing-dependent too.

It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").

And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.

Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case. FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.

Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a9e6ffbc 05-Aug-2021 Tuo Li <islituo@gmail.com>

ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()

kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
ceph_mdsmap_destroy(m);

In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
kfree(m->m_info[i].export_targets);

To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.

[ jlayton: fix up whitespace damage
only kfree(m->m_info) if it's non-NULL ]

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b2f9fa1f 18-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: correctly handle releasing an embedded cap flush

The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads. It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

18598195 03-Jun-2021 David Howells <dhowells@redhat.com>

cachefiles: Use file_inode() rather than accessing ->f_inode

Use the file_inode() helper rather than accessing ->f_inode directly.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/

a7e20e31 12-May-2021 David Howells <dhowells@redhat.com>

netfs: Move cookie debug ID to struct netfs_cache_resources

Move the cookie debug ID from struct netfs_read_request to struct
netfs_cache_resources and drop the 'cookie_' prefix. This makes it
available for things that want to use netfs_cache_resources without having
a netfs_read_request.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/

4c5e4139 04-Jun-2021 David Howells <dhowells@redhat.com>

fscache: Select netfs stats if fscache stats are enabled

Unconditionally select the stats produced by the netfs lib if fscache stats
are enabled as the former are displayed in the latter's procfile.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/

1266b4a7 25-Aug-2021 Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: fix double free of 'copied'

Dan reported a new smatch warning [1]
"fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'"

Due to new chunk-based format handling logic, the error path can be
called after kfree(copied).

Set "copied = NULL" after kfree(copied) to fix this.

[1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com

Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

4e965576 24-Aug-2021 Qu Wenruo <wqu@suse.com>

Revert "btrfs: compression: don't try to compress if we don't have enough pages"

This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763.

[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627319f ("btrfs: compression: don't try to compress if we don't
have enough pages").

[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.

- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.

- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.

For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.

[FIX]
Fix it by reverting the offending commit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

aaa4db12 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: accept directly into fixed file table

As done with open opcodes, allow accept to skip installing fd into
processes' file tables and put it directly into io_uring's fixed file
table. Same restrictions and design as for open.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a7083ad5 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: hand code io_accept() fd installing

Make io_accept() to handle file descriptor allocations and installation.
A preparation patch for bypassing file tables.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9445598 24-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: openat directly into fixed fd table

Instead of opening a file into a process's file table as usual and then
registering the fd within io_uring, some users may want to skip the
first step and place it directly into io_uring's fixed file table.
This patch adds such a capability for IORING_OP_OPENAT and
IORING_OP_OPENAT2.

The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour using normal file tables. If non-zero value is
specified, then it will behave as described and place the file into a
fixed file slot sqe->file_index - 1. A file table should be already
created, the slot should be valid and empty, otherwise the operation
will fail.

Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and
EINVAL on inappropriate fixed tables, and return EBADF on collision with
already registered file.

Note: IOSQE_FIXED_FILE can't be used to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c42dd069 24-Aug-2021 Sishuai Gong <sishuai@purdue.edu>

configfs: fix a race in configfs_lookup()

When configfs_lookup() is executing list_for_each_entry(),
it is possible that configfs_dir_lseek() is calling list_del().
Some unfortunate interleavings of them can cause a kernel NULL
pointer dereference error

Thread 1 Thread 2
//configfs_dir_lseek() //configfs_lookup()
list_del(&cursor->s_sibling);
list_for_each_entry(sd, ...)

Fix this by grabbing configfs_dirent_lock in configfs_lookup()
while iterating ->s_children.

Signed-off-by: Sishuai Gong <sishuai@purdue.edu>
Signed-off-by: Christoph Hellwig <hch@lst.de>

d07f132a 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: fold configfs_attach_attr into configfs_lookup

This makes it more clear what gets added to the dcache and prepares
for an additional locking fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>

899587c8 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: simplify the configfs_dirent_is_ready

Return the error directly instead of using a goto.

Signed-off-by: Christoph Hellwig <hch@lst.de>

417b962d 24-Aug-2021 Christoph Hellwig <hch@lst.de>

configfs: return -ENAMETOOLONG earlier in configfs_lookup

Just like most other file systems: get the simple checks out of the
way first.

Signed-off-by: Christoph Hellwig <hch@lst.de>

f38a032b 24-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: fix I_DONTCACHE

Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads
make it clear that it doesn't work as it should.

Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

158ee7b6 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: mark blkdev_fsync static

blkdev_fsync is only used inside of block_dev.c since the
removal of the raw drіver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824151823.1575100-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2949e842 24-Aug-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

fs: clean up after mandatory file locking support removal

Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes
some operations in functions rw_verify_area().

As these functions are now simplified, do some syntactic clean-up as
follow-up to the removal as well, which was pointed out by compiler
warnings and static analysis.

No functional change.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

72a048c1 20-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: only set IOMAP_F_SHARED when providing a srcmap to a write

While prototyping a free space defragmentation tool, I observed an
unexpected IO error while running a sequence of commands that can be
recreated by the following sequence of commands:

# xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1
# cp --reflink=always file1 file2
# punch-alternating -o 1 file2
# xfs_io -c "funshare 0 10m" file2
fallocate: Input/output error

I then scraped this (abbreviated) stack trace from dmesg:

WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:iomap_write_begin+0x376/0x450
RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297
RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000
RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c
RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50
R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40
R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000
FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0
Call Trace:
iomap_unshare_actor+0x95/0x140
iomap_apply+0xfa/0x300
iomap_file_unshare+0x44/0x60
xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
vfs_fallocate+0x133/0x330
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4b8f79140a

Looking at the iomap tracepoints, I saw this:

iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED
iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags
iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED
console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450

The first time funshare calls ->iomap_begin, xfs sees that the first
block is shared and creates a 128k delalloc reservation in the COW fork.
The delalloc reservation is returned as dstmap, and the shared block is
returned as srcmap. So far so good.

funshare calls ->iomap_begin to try the second block. This time there's
no srcmap (punch-alternating punched it out!) but we still have the
delalloc reservation in the COW fork. Therefore, we again return the
reservation as dstmap and the hole as srcmap. iomap_unshare_iter
incorrectly tries to unshare the hole, which __iomap_write_begin rejects
because shared regions must be fully written and therefore cannot
require zeroing.

Therefore, change the buffered write iomap_begin function not to set
IOMAP_F_SHARED when there isn't a source mapping to read from for the
unsharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

7f024fcd 23-Aug-2021 J. Bruce Fields <bfields@redhat.com>

Keep read and write fds with each nlm_file

We shouldn't really be using a read-only file descriptor to take a write
lock.

Most filesystems will put up with it. But NFS, for example, won't.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

cf30da90 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_LINKAT

IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.

In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7a8721f8 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_SYMLINKAT

IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
and arguments.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

01cfa28a 12-Aug-2021 Christoph Hellwig <hch@lst.de>

block: use the percpu bio cache in __blkdev_direct_IO

Use bio_alloc_kiocb to dip into the percpu cache of bios when the
caller asks for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

394918eb 08-Mar-2021 Jens Axboe <axboe@kernel.dk>

io_uring: enable use of bio alloc cache

Mark polled IO as being safe for dipping into the bio allocation
cache, in case the targeted bio_set has it enabled.

This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to
~3.5M IOPS.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dadebc35 23-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix io_try_cancel_userdata race for iowq

WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.

note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.

Cc: stable@vger.kernel.org
Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e34a02dc 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

io_uring: add support for IORING_OP_MKDIRAT

IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags
and arguments.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45f30dab 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: update do_*() helpers to return ints

Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:

* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

020250f3 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_linkat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with do_renameat2, do_unlinkat, do_mknodat, etc.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8228e2c3 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: add getname_uflags()

There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future. Let's just add a simple helper getname_uflags() that handles
this directly and use it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da2d0ced 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_symlinkat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_mkdnodat(), do_unlinkat(),
do_renameat(), do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7797251b 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_mknodat() take struct filename

Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_unlinkat(), do_renameat(),
do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

584d3226 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: make do_mkdirat() take struct filename

Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same. This is heavily based on
commit dbea8d345177 ("fs: make do_renameat2() take struct filename").

This behaves like do_unlinkat() and do_renameat2().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0ee50b47 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: change filename_parentat() calling conventions

Since commit 5c31b6cedb675 ("namei: saner calling conventions for
filename_parentat()") filename_parentat() had the following behavior WRT
the passed in struct filename *:

* On error the name is consumed (putname() is called on it);
* On success the name is returned back as the return value;

Now there is a need for filename_create() and filename_lookup() variants
that do not consume the passed filename, and following the same "consume
the name only on error" semantics is proven to be hard to reason about
and result in confusing code.

Hence this preparation change splits filename_parentat() into two: one
that always consumes the name and another that never consumes the name.
This will allow to implement two filename_create() variants in the same
way, and is a consistent and hopefully easier to reason about approach.

Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91ef658f 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: ignore ERR/NULL names in putname()

Supporting ERR/NULL names in putname() makes callers code cleaner, and
is what some other path walking functions already support for the same
reason.

This also removes a few existing IS_ERR checks before putname().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

126180b9 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: IRQ rw completion batching

Employ inline completion logic for read/write completions done via
io_req_task_complete(). If ->uring_lock is contended, just do normal
request completion, but if not, make tctx_task_work() to grab the lock
and do batched inline completions in io_req_task_complete().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f237c30a 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: batch task work locking

Many task_work handlers either grab ->uring_lock, or may benefit from
having it. Move locking logic out of individual handlers to a lazy
approach controlled by tctx_task_work(), so we don't keep doing
tons of mutex lock/unlock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5636c00d 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: flush completions for fallbacks

io_fallback_req_func() doesn't expect anyone creating inline
completions, and no one currently does that. Teach the function to flush
completions preparing for further changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26578cda 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add ->splice_fd_in checks

->splice_fd_in is used only by splice/tee, but no other request checks
it for validity. Add the check for most of request types excluding
reads/writes/sends/recvs, we don't want overhead for them and can leave
them be as is until the field is actually used.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2c5d763c 21-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: add clarifying comment for io_cqring_ev_posted()

We've previously had an issue where overflow flush unconditionally calls
io_cqring_ev_posted() even if it didn't flush any events to the ring,
causing wake and eventfd increment where no new events are available.
Some applications don't like that, see commit b18032bb0a88 for details.

This came up in discussion for another patch recently, hence add a
comment detailing what the relationship between calling the events
posted helper and CQ ring entries is.

Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0bea96f5 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: place fixed tables under memcg limits

Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a1b8a4e 20-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: limit fixed table size by RLIMIT_NOFILE

Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
that's the first and the simpliest restriction that we should impose.

Cc: stable@vger.kernel.org
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

99c8bc52 20-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: fix lack of protection for compl_nr

coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
may cause problems when accessing in parallel:

say coml_nr > 0

ctx_flush_and put other context
if (compl_nr) get mutex
coml_nr > 0
do flush
coml_nr = 0
release mutex
get mutex
do flush (*)
release mutex

in (*) place, we call io_cqring_ev_posted() and users likely get
no events there. To avoid spurious events, re-check the value when
under the lock.

Fixes: 2c32395d8111 ("io_uring: fix __tctx_task_work() ctx race")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

187f08c1 18-Aug-2021 wangyangbo <wangyangbo@uniontech.com>

io_uring: Add register support for non-4k PAGE_SIZE

Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and
accessing this table relies on each level index from fixed TABLE_SHIFT
(12 - 3) in 4k page case. In order to correctly work in non-4k page,
define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for
2nd-level table entry number.

Signed-off-by: wangyangbo <wangyangbo@uniontech.com>
Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e98e49b2 18-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: extend task put optimisations

Now with IRQ completions done via IRQ, almost all requests freeing
are done from the context of submitter task, so it makes sense to
extend task_put optimisation from io_req_free_batch_finish() to cover
all the cases including task_work by moving it into io_put_task().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

316319e8 19-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: add comments on why PF_EXITING checking is safe

We have two checks of task->flags & PF_EXITING left:

1) In io_req_task_submit(), which is called in task_work and hence always
in the context of the original task. That means that
req->task == current, and hence checking ->flags is totally fine.

2) In io_poll_rewait(), where we need to stop re-arming poll to prevent
it interfering with cancelation. This is only run from task_work as
well, and hence for this case too req->task == current.

Add a comment to both spots detailing that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

79dca184 10-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io-wq: move nr_running and worker_refs out of wqe->lock protection

We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ec3c3d0f 18-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: fix io_timeout_remove locking

io_timeout_cancel() posts CQEs so needs ->completion_lock to be held,
so grab it in io_timeout_remove().

Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

23a65db8 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: improve same wq polling

Move earlier the check for whether __io_queue_proc() tries to poll
already polled waitqueue, and do the same for the second poll entry, if
any. Shouldn't really matter, but at least it would have a more
predictable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

505657bc 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: reuse io_req_complete_post()

We have io_req_complete_post() to post a CQE and put the request. It
takes care of all synchronisation and is more concise and efficent, so
replace all hancoded occurrences of
"lock; post CQE; unlock; + put_req()" with io_req_complete_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ae421d93 17-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: better encapsulate buffer select for rw

Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the
callers don't need to hand code it. The number of places where we call
io_put_rw_kbuf() is growing, so saves some pain.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

906c6caa 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise io_prep_linked_timeout()

Linked timeout handling during issuing is heavy, it adds extra
instructions and forces to save the next linked timeout before
io_issue_sqe().

Follwing the same reasoning as in refcounting patches, a request can't
be freed by the time it returns from io_issue_sqe(), so now we don't
need to do io_prep_linked_timeout() in advance, and it can be delayed to
colder paths optimising the generic path.

Also, it should also save quite a lot for requests with linked timeouts
and completed inline on timeout spinlocking + hrtimer_start() +
hrtimer_try_to_cancel() and so on.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0756a869 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: cancel not-armed linked touts separately

Adjust io_disarm_next(), so it can detect if there is a linked but
not-yet-armed timeout and complete/cancel it separately. Will be used in
the following patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4d13d1a4 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: simplify io_prep_linked_timeout

The link test in io_prep_linked_timeout() is pretty bulky, replace it
with a flag. It's better for normal path and linked requests, and also
will be used further for request failing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b97e736a 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill REQ_F_LTIMEOUT_ACTIVE

Instead of handling double consecutive linked timeouts through tricky
flag combinations, just check the submit_state.link during timeout_prep
and fail that case in advance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fd08e530 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise hot path of ltimeout prep

io_prep_linked_timeout() grew too heavy and compiler now refuse to
inline the function. Help it by splitting in two and annotating with
inline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8cb01fac 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: deduplicate cancellation code

IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of
overlap, so extract a helper for request cancellation and use in both.
Also, removes some amount of ugliness because of success_ret.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8576af9 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill not necessary resubmit switch

773af69121ecc ("io_uring: always reissue from task_work context") makes
all resubmission to be made from task_work, so we don't need that hack
with resubmit/not-resubmit switch anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb682099 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise initial ltimeout refcounting

Linked timeouts are never refcounted when it comes to the first call to
__io_prep_linked_timeout(), so save an io_ref_get() and set the desired
value directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

761bcac1 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't inflight-track linked timeouts

Tracking linked timeouts as infligh was needed to make sure that io-wq
is not destroyed by io_uring_cancel_generic() racing with
io_async_cancel_one() accessing it. Now, cancellations issued by linked
timeouts are done in the task context, so it's already synchronised.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

48dcd38d 15-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise iowq refcounting

If a requests is forwarded into io-wq, there is a good chance it hasn't
been refcounted yet and we can save one req_ref_get() by setting the
refcount number to the right value directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a141dd89 12-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: correct __must_hold annotation

io_req_free_batch() has a __must_hold annotation referencing a
request being passed in, but we're passing in the context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

41a5169c 12-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: code clean for completion_lock in io_arm_poll_handler()

We can merge two spin_unlock() operations to one since we removed some
code not long ago.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f552a27a 11-Aug-2021 Hao Xu <haoxu@linux.alibaba.com>

io_uring: remove files pointer in cancellation functions

When doing cancellation, we use a parameter to indicate where it's from
do_exit or exec. So a boolean value is good enough for this, remove the
struct files* as it is not necessary.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

20e60a38 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: skip request refcounting

As submission references are gone, there is only one initial reference
left. Instead of actually doing atomic refcounting, add a flag
indicating whether we're going to take more refs or doing any other sync
magic. The flag should be set before the request may get used in
parallel.

Together with the previous patch it saves 2 refcount atomics per request
for IOPOLL and IRQ completions, and 1 atomic per req for inline
completions, with some exceptions. In particular, currently, there are
three cases, when the refcounting have to be enabled:
- Polling, including apoll. Because double poll entries takes a ref.
Might get relaxed in the near future.
- Link timeouts, enabled for both, the timeout and the request it's
bound to, because they work in-parallel and we need to synchronise
to cancel one of them on completion.
- When a request gets in io-wq, because it doesn't hold uring_lock and
we need guarantees of submission references.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d5901a3 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove submission references

Requests are by default given with two references, submission and
completion. Completion references are straightforward, they represent
request ownership and are put when a request is completed or so.
Submission references are a bit more trickier. They're needed when
io_issue_sqe() followed deep into the submission stack (e.g. in fs,
block, drivers, etc.), request may have given away for concurrent
execution or already completed, and the code unwinding back to
io_issue_sqe() may be accessing some pieces of our requests, e.g.
file or iov.

Now, we prevent such async/in-depth completions by pushing requests
through task_work. Punting to io-wq is also done through task_works,
apart from a couple of cases with a pretty well known context. So,
there're two cases:
1) io_issue_sqe() from the task context and protected by ->uring_lock.
Either requests return back to io_uring or handed to task_work, which
won't be executed because we're currently controlling that task. So,
we can be sure that requests are staying alive all the time and we don't
need submission references to pin them.

2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of
submission reference is played by io-wq reference, which is put by
io_wq_submit_work(). Hence, it should be fine.

Considering that, we can carefully kill the submission reference.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91c2f697 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove req_ref_sub_and_test()

Soon, we won't need to put several references at once, remove
req_ref_sub_and_test() and @nr argument from io_put_req_deferred(),
and put the rest of the references by hand.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

21c843d5 11-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move req_ref_get() and friends

Move all request refcount helpers to avoid forward declarations in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

79ebeaee 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: remove IRQ aspect of io_ring_ctx completion lock

We have no hard/soft IRQ users of this lock left, remove any IRQ
disabling/saving and restoring when grabbing this lock.

This is straight forward with no users entering with IRQs disabled
anymore, the only thing to look out for is the waitqueue poll head
lock which nests inside the completion lock. That needs IRQs disabled,
and hence we have to do that now instead of relying on the outer lock
doing so.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

8ef12efe 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run regular file completions from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

89b263f6 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run linked timeouts from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

89850fce 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: run timeouts from task_work

This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Add a timeout_lock to handle the ordering of timeout completions or
cancelations with the timeouts actually triggering.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

62906e89 10-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove file batch-get optimisation

For requests with non-fixed files, instead of grabbing just one
reference, we get by the number of left requests, so the following
requests using the same file can take it without atomics.

However, it's not all win. If there is one request in the middle
not using files or having a fixed file, we'll need to put back the left
references. Even worse if an application submits requests dealing with
different files, it will do a put for each new request, so doubling the
number of atomics needed. Also, even if not used, it's still takes some
cycles in the submission path.

If a file used many times, it rather makes sense to pre-register it, if
not, we may fall in the described pitfall. So, this optimisation is a
matter of use case. Go with the simpliest code-wise way, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6294f368 10-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clean up tctx_task_work()

After recent fixes, tctx_task_work() always does proper spinlocking
before looking into ->task_list, so now we don't need atomics for
->task_state, replace it with non-atomic task_running using the critical
section.

Tide it up, combine two separate block with spinlocking, and always try
to splice in there, so we do less locking when new requests are arriving
during the function execution.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix missing ->task_running reset on task_work_add() failure]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d709043 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline io_poll_remove_waitqs

Inline io_poll_remove_waitqs() into its only user and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90f67366 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove extra argument for overflow flush

Unlike __io_cqring_overflow_flush(), nobody does forced flushing with
io_cqring_overflow_flush(), so removed the argument from it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd0ca2e0 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline struct io_comp_state

Inline struct io_comp_state into struct io_submit_state. They are
already coupled tightly, together with mixed responsibilities it
only brings confusion having them separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb943b82 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: use inflight_entry instead of compl.list

req->compl.list is used to cache freed requests, and so can't overlap in
time with req->inflight_entry. So, use inflight_entry to link requests
and remove compl.list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7255834e 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove redundant args from cache_free

We don't use @tsk argument of io_req_cache_free(), remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c34b025f 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: cache __io_free_req()'d requests

Don't kfree requests in __io_free_req() but put them back into the
internal request cache. That makes allocations more sustainable and will
be used for refcounting optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f56165e6 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_fallback_req_func()

Move io_fallback_req_func() to kill yet another forward declaration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e9dbe221 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise putting task struct

We cache all the reference to task + tctx, so if io_put_task() is
called by the corresponding task itself, we can save on atomics and
return the refs right back into the cache.

It's beneficial for all inline completions, and also iopolling, when
polling and submissions are done by the same task, including
SQPOLL|IOPOLL.

Note: io_uring_cancel_generic() can return refs to the cache as well,
so those should be flushed in the loop for tctx_inflight() to work
right.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

af066f31 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: drop exec checks from io_req_task_submit

In case of on-exec io_uring cancellations, tasks already wait for all
submitted requests to get completed/cancelled, so we don't need to check
for ->in_execve separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bbbca094 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: kill unused IO_IOPOLL_BATCH

IO_IOPOLL_BATCH is not used, delete it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

58d3be2c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: improve ctx hang handling

If io_ring_exit_work() can't get it done in 5 minutes, something is
going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help
and it may take much of CPU time if there is a lot of workers stuck as
such.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3fddf6d 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: deduplicate open iopoll check

Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat
and openat2 reuse it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

543af3a1 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline io_free_req_deferred

Inline io_free_req_deferred(), there is no reason to keep it separated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9bd2bea 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_rsrc_node_alloc() definition

Move the function together with io_rsrc_node_ref_zero() in the source
file as it is to get rid of forward declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6a290a14 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: move io_put_task() definition

Move the function in the source file as it is to get rid of forward
declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e73c5c7c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: extract a helper for ctx quiesce

Refactor __io_uring_register() by extracting a helper responsible for
ctx queisce. Looks better and will make it easier to add more
optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90291099 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: optimise io_cqring_wait() hot path

Turns out we always init struct io_wait_queue in io_cqring_wait(), even
if it's not used after, i.e. there are already enough of CQEs. And often
it's exactly what happens, for instance, requests may have been
completed inline, or in case of io_uring_enter(submit=N, wait=1).

It shows up in my profiler, so optimise it by delaying the struct init.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com
[axboe: fixed up for new cqring wait]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

282cdc86 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: add more locking annotations for submit

Add more annotations for submission path functions holding ->uring_lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a2416e1e 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: don't halt iopoll too early

IOPOLL users should care more about getting completions for requests
they submitted, but not in "device did/completed something". Currently,
io_do_iopoll() may return a positive number, which will instruct
io_iopoll_check() to break the loop and end the syscall, even if there
is not enough CQEs or none at all.

Don't return positive numbers, so io_iopoll_check() exits only when it
gets an actual error, need reschedule or got enough CQEs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

864ea921 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: refactor io_alloc_req

Replace the main if of io_flush_cached_reqs() with inverted condition +
goto, so all the cases are handled in the same way. And also extract
io_preinit_req() to make it cleaner and easier to refer to.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8724dd8c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io-wq: improve wq_list_add_tail()

Prepare nodes that we're going to add before actually linking them, it's
always safer and costs us nothing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2215bed9 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: remove unnecessary PF_EXITING check

We prefer nornal task_works even if it would fail requests inside. Kill
a PF_EXITING check in io_req_task_work_add(), task_work_add() handles
well dying tasks, i.e. return error when can't enqueue due to late
stages of do_exit().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ebc11b6c 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: clean io-wq callbacks

Move io-wq callbacks closer to each other, so it's easier to work with
them, and rename io_free_work() into io_wq_free_work() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c97d8a0f 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: avoid touching inode in rw prep

If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set.
However, for non-reg files io_prep_rw() still will look into inode to
double check, and that's expensive and can be avoided.

The only caveat is that it only currently works with 64+ bit
architectures, see FFS_ISREG, so we should consider that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b191e2df 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: rename io_file_supports_async()

io_file_supports_async() checks whether a file supports nowait
operations, so "async" in the name is misleading. Rename it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ac177053 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: inline fixed part of io_file_get()

Optimise io_file_get() with registered files, which is in a hot path,
by inlining parts of the function. Saves a function call, and
inefficiencies of passing arguments, e.g. evaluating
(sqe_flags & IOSQE_FIXED_FILE).

It couldn't have been done before as compilers were refusing to inline
it because of the function size.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

042b0d85 09-Aug-2021 Pavel Begunkov <asml.silence@gmail.com>

io_uring: use kvmalloc for fixed files

Instead of hand-coded two-level tables for registered files, allocate
them with kvmalloc(). In many cases small enough tables are enough, and
so can be kmalloc()'ed removing an extra memory load and a bunch of bit
logic instructions from the hot path. If the table is larger, we trade
off all the pros with a TLB-assisted memory lookup.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5fd46178 06-Aug-2021 Jens Axboe <axboe@kernel.dk>

io_uring: be smarter about waking multiple CQ ring waiters

Currently we only wake the first waiter, even if we have enough entries
posted to satisfy multiple waiters. Improve that situation so that
every waiter knows how much the CQ tail has to advance before they can
be safely woken up.

With this change, if we have N waiters each asking for 1 event and we get
4 completions, then we wake up 4 waiters. If we have N waiters asking
for 2 completions and we get 4 completions, then we wake up the first
two. Previously, only the first waiter would've been woken up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3e9f732 04-Aug-2021 Jens Axboe <axboe@kernel.dk>

io-wq: remove GFP_ATOMIC allocation off schedule out path

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [ 90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [ 90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [ 90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G W 5.14.0-rc4-rt4+ #89
| [ 90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [ 90.202561] Call Trace:
| [ 90.202577] dump_stack_lvl+0x34/0x44
| [ 90.202584] ___might_sleep.cold+0x87/0x94
| [ 90.202588] rt_spin_lock+0x19/0x70
| [ 90.202593] ___slab_alloc+0xcb/0x7d0
| [ 90.202598] ? newidle_balance.constprop.0+0xf5/0x3b0
| [ 90.202603] ? dequeue_entity+0xc3/0x290
| [ 90.202605] ? io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202610] ? pick_next_task_fair+0xb9/0x330
| [ 90.202612] ? __schedule+0x670/0x1410
| [ 90.202615] ? io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202618] kmem_cache_alloc_trace+0x79/0x1f0
| [ 90.202621] io_wqe_dec_running.isra.0+0x98/0xe0
| [ 90.202625] io_wq_worker_sleeping+0x37/0x50
| [ 90.202628] schedule+0x30/0xd0
| [ 90.202630] schedule_timeout+0x8f/0x1a0
| [ 90.202634] ? __bpf_trace_tick_stop+0x10/0x10
| [ 90.202637] io_wqe_worker+0xfd/0x320
| [ 90.202641] ? finish_task_switch.isra.0+0xd3/0x290
| [ 90.202644] ? io_worker_handle_work+0x670/0x670
| [ 90.202646] ? io_worker_handle_work+0x670/0x670
| [ 90.202649] ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94c821fb 20-Aug-2021 Chao Yu <chao@kernel.org>

f2fs: rebuild nat_bits during umount

If all free_nat_bitmap are available, we can rebuild nat_bits from
free_nat_bitmap entirely during umount, let's make another chance
to reenable nat_bits for image.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

a4b68176 20-Aug-2021 Daeho Jeong <daehojeong@google.com>

f2fs: introduce periodic iostat io latency traces

Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.

So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.

[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

52118743 19-Aug-2021 Daeho Jeong <daehojeong@google.com>

f2fs: separate out iostat feature

Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b661601a 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

lockd: update nlm_lookup_file reexport comment

Update comment to reflect that we *do* allow reexport, whether it's a
good idea or not....

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a81041b7 23-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nlm: minor refactoring

Make this lookup slightly more concise, and prepare for changing how we
look this up in a following patch.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2dc6f19e 22-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nlm: minor nlm_lookup_file argument change

It'll come in handy to get the whole nlm_lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

11e4e66e 23-Aug-2021 aalexandrovich <88376726+aalexandrovich@users.noreply.github.com>

Merge branch 'torvalds:master' into master


0d977e0e 20-Aug-2021 Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

btrfs: reset replace target device to allocation state on close

This crash was observed with a failed assertion on device close:

BTRFS: Transaction aborted (error -28)
WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
FS: 0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
Call Trace:
flush_space+0x197/0x2f0 [btrfs]
btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
process_one_work+0x262/0x5e0
worker_thread+0x4c/0x320
? process_one_work+0x5e0/0x5e0
kthread+0x144/0x170
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
irq event stamp: 19334989
hardirqs last enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
softirqs last enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
---[ end trace 45939e308e0dd3c7 ]---
BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
BTRFS info (device vdd): forced readonly
BTRFS warning (device vdd): failed setting block group ro: -30
BTRFS info (device vdd): suspending dev_replace for unmount
assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 3982 Comm: umount Tainted: G W 5.14.0-rc5-default+ #1532
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
FS: 00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
Call Trace:
btrfs_close_one_device.cold+0x11/0x55 [btrfs]
close_fs_devices+0x44/0xb0 [btrfs]
btrfs_close_devices+0x48/0x160 [btrfs]
generic_shutdown_super+0x69/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x2c/0xa0
cleanup_mnt+0x144/0x1b0
task_work_run+0x59/0xa0
exit_to_user_mode_loop+0xe7/0xf0
exit_to_user_mode_prepare+0xaf/0xf0
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x4a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:

close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
&device->dev_state));

However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.

To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

58bc6d1b 22-Aug-2021 Stian Skjelstad <stian.skjelstad@gmail.com>

udf_get_extendedattr() had no boundary checks.

When parsing the ExtendedAttr data, malicous or corrupt attribute length
could cause kernel hangs and buffer overruns in some special cases.

Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com
Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

939c7feb 11-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix ordered extent boundary calculation

btrfs_lookup_ordered_extent() is supposed to query the offset in a file
instead of the logical address. Pass the file offset from
submit_extent_page() to calc_bio_boundaries().

Also, calc_bio_boundaries() relies on the bio's operation flag, so move
the call site after setting it.

Fixes: 390ed29b817e ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

11462397 11-Aug-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: do not do preemptive flushing if the majority is global rsv

A common characteristic of the bug report where preemptive flushing was
going full tilt was the fact that the vast majority of the free metadata
space was used up by the global reserve. The hard 90% threshold would
cover the majority of these cases, but to be even smarter we should take
into account how much of the outstanding reservations are covered by the
global block reserve. If the global block reserve accounts for the vast
majority of outstanding reservations, skip preemptive flushing, as it
will likely just cause churn and pain.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

93c60b17 11-Aug-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: reduce the preemptive flushing threshold to 90%

The preemptive flushing code was added in order to avoid needing to
synchronously wait for ENOSPC flushing to recover space. Once we're
almost full however we can essentially flush constantly. We were using
98% as a threshold to determine if we were simply full, however in
practice this is a really high bar to hit. For example reports of
systems running into this problem had around 94% usage and thus
continued to flush. Fix this by lowering the threshold to 90%, which is
a more sane value, especially for smaller file systems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
CC: stable@vger.kernel.org # 5.12+
Fixes: 576fa34830af ("btrfs: improve preemptive background space flushing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3736127a 02-Aug-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: tree-log: check btrfs_lookup_data_extent return value

Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.

Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.

Fix the condition by checking if the value was bellow zero and return
early.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8be2ba2e 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: avoid unnecessarily logging directories that had no changes

There are several cases where when logging an inode we need to log its
parent directories or logging subdirectories when logging a directory.

There are cases however where we end up logging a directory even if it was
not changed in the current transaction, no dentries added or removed since
the last transaction. While this is harmless from a functional point of
view, it is a waste time as it brings no advantage.

One example where this is triggered is the following:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt

$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C

$ touch /mnt/A/foo
$ ln /mnt/A/foo /mnt/B/bar
$ ln /mnt/A/foo /mnt/C/baz

$ sync

$ rm -f /mnt/A/foo
$ xfs_io -c "fsync" /mnt/B/bar

This last fsync ends up logging directories A, B and C, however we only
need to log directory A, as B and C were not changed since the last
transaction commit.

So fix this by changing need_log_inode(), to return false in case the
given inode is a directory and has a ->last_trans value smaller than the
current transaction's ID.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5b9b26f5 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mount

Now that we converted btrfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag. We only need to raise this flag on the btrfs_root_fs_type
filesystem since btrfs_mount_root() is ultimately responsible for
allocating the superblock and is called into from btrfs_mount()
associated with btrfs_fs_type.

The conversion of the btrfs inode operations was straightforward.
Regarding btrfs specific ioctls that perform checks based on inode
permissions only those have been allowed that are not filesystem wide
operations and hence can be reasonably charged against a specific mount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4a8b34af 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: handle ACLs on idmapped mounts

Make the ACL code idmapped mount aware. The POSIX default and POSIX
access ACLs are the only ACLs other than some specific xattrs that take
DAC permissions into account. On an idmapped mount they need to be
translated according to the mount's userns. The main change is done to
__btrfs_set_acl() which is responsible for translating POSIX ACLs to
their final on-disk representation.

The btrfs_init_acl() helper does not need to take the idmapped mount
into account since it is called in the context of file creation
operations (mknod, create, mkdir, symlink, tmpfile) and is used for
btrfs_init_inode_security() to copy POSIX default and POSIX access
permissions from the parent directory. These ACLs need to be inherited
unmodified from the parent directory. This is identical to what we do
for ext4 and xfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6623d9a0 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped INO_LOOKUP_USER ioctl

The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
and has the following restrictions. The main difference between the two
is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
is scoped beneath the file descriptor passed with the ioctl.
Specifically, INO_LOOKUP_USER must adhere to the following restrictions:

- The caller must be privileged over each inode of each path component
for the path they are trying to lookup.

- The path for the subvolume the caller is trying to lookup must be reachable
from the inode associated with the file descriptor passed with the ioctl.

The second condition makes it possible to scope the lookup of the path
to the mount identified by the file descriptor passed with the ioctl.
This allows us to enable this ioctl on idmapped mounts.

Specifically, this is possible because all child subvolumes of a parent
subvolume are reachable when the parent subvolume is mounted. So if the
user had access to open the parent subvolume or has been given the fd
then they can lookup the path if they had access to it provided they
were privileged over each path component.

Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
of a subvolume even though they would otherwise be restricted from doing
so via regular VFS-based lookup.

So think about a parent subvolume with multiple child subvolumes.
Someone could mount he parent subvolume and restrict access to the child
subvolumes by overmounting them with empty directories. At this point
the user can't traverse the child subvolumes and they can't open files
in the child subvolumes. However, they can still learn the path of
child subvolumes as long as they have access to the parent subvolume by
using the INO_LOOKUP_USER ioctl.

The underlying assumption here is that it's ok that the lookup ioctls
can't really take mounts into account other than the original mount the
fd belongs to during lookup. Since this assumption is baked into the
original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

39e1674f 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SUBVOL_SETFLAGS ioctl

Setting flags on subvolumes or snapshots are core features of btrfs. The
SUBVOL_SETFLAGS ioctl is especially important as it allows to make
subvolumes and snapshots read-only or read-write. Allow setting flags on
btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e4fed17a 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls

The SET_RECEIVED_SUBVOL ioctls are used to set information about
a received subvolume. Make it possible to set information about a
received subvolume on idmapped mounts. This is a fairly straightforward
operation since all the permission checking helpers are already capable
of handling idmapped mounts. So we just need to pass down the mount's
userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

aabb34e7 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids

So far we prevented the deletion of subvolumes and snapshots using
subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.

This restriction is necessary on idmapped mounts as this allows
filesystem wide subvolume and snapshot deletions and thus can escape the
scope of what's exposed under the mount identified by the fd passed with
the ioctl.

Deletion by subvolume id works by looking for an alias of the parent of
the subvolume or snapshot to be deleted. The parent alias can be
anywhere in the filesystem. However, as long as the alias of the parent
that is found is the same as the one identified by the file descriptor
passed through the ioctl we can allow the deletion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c4ed533b 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SNAP_DESTROY ioctls

Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.

This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.

This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.

The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.

This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.

In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.

Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:

/* Compile the following program as "delete_by_spec". */

#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;

args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;

ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;

return 0;
}

int main(int argc, char *argv[])
{
int subvolid = 0;

if (argc < 3)
exit(1);

fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);

subvolid = atoi(argv[2]);

fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);

exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"

truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}

With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.

The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4d4340c9 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls

Creating subvolumes and snapshots is one of the core features of btrfs
and is even available to unprivileged users. Make it possible to use
subvolume and snapshot creation on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5474bf40 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: check whether fsgid/fsuid are mapped during subvolume creation

When a new subvolume is created btrfs currently doesn't check whether
the fsgid/fsuid of the caller actually have a mapping in the user
namespace attached to the filesystem. The VFS always checks this to make
sure that the caller's fsgid/fsuid can be represented on-disk. This is
most relevant for filesystems that can be mounted inside user namespaces
but it is in general a good hardening measure to prevent unrepresentable
gid/uid from being written to disk.

Since we want to support idmapped mounts for btrfs ioctls to create
subvolumes in follow-up patches this becomes important since we want to
make sure the fsgid/fsuid of the caller as mapped according to the
idmapped mount can be represented on-disk. Simply add the missing
fsuidgid_has_mapping() line from the VFS may_create() version to
btrfs_may_create().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3bc71ba0 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped permission inode op

Enable btrfs_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d4d09464 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped setattr inode op

Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

98b6ab5f 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped tmpfile inode op

Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5a052108 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped symlink inode op

Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b0b3e44d 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mkdir inode op

Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e93ca491 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped create inode op

Enable btrfs_create() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

72105277 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped mknod inode op

Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c020d2ea 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped getattr inode op

Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ca07274c 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: allow idmapped rename inode op

Enable btrfs_rename() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b3b6f5b9 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

btrfs: handle idmaps in btrfs_new_inode()

Extend btrfs_new_inode() to take the idmapped mount into account when
initializing a new inode. This is just a matter of passing down the
mount's userns. The rest is taken care of in inode_init_owner(). This is
a preliminary patch to make the individual btrfs inode operations
idmapped mount aware.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c2fd68b6 26-Jul-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: add mapping aware lookup helper

Various filesystems rely on the lookup_one_len() helper to lookup a
single path component relative to a well-known starting point. Allow
such filesystems to support idmapped mounts by adding a version of this
helper to take the idmap into account when calling inode_permission().
This change is a required to let btrfs (and other filesystems) support
idmapped mounts.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e7849e33 10-Aug-2021 Anand Jain <anand.jain@oracle.com>

btrfs: sysfs: document structures and their associated files

Sysfs file has grown big. It takes some time to locate the correct
struct attribute to add new files. Create a table and map the struct
attribute to its sysfs path.

Also, fix the comment about the debug sysfs path. And add the comments
to the attributes instead of attribute group, where sysfs file names are
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e4571b8c 06-Aug-2021 Qu Wenruo <wqu@suse.com>

btrfs: fix NULL pointer dereference when deleting device by invalid id

[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:

# mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
/dev/test/scratch2
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs device remove 3 /mnt/btrfs

Then we have the following kernel NULL pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
btrfs_ioctl+0x18bb/0x3190 [btrfs]
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
? do_user_addr_fault+0x201/0x6a0
? lock_release+0xd2/0x2d0
? __x64_sys_ioctl+0x83/0xb0
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

[CAUSE]
Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().

But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.

In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.

Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.

[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.

Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

63fb5879 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: add asserts on splitting extent_map

We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
we can assume the extent_map is PINNED, not LOGGING, and in the modified
list. Add ASSERT()s to ensure the extent_maps after the split also has the
proper flags set and are in the modified list.

Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0ae79c6f 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: fix block group alloc_offset calculation

alloc_offset is offset from the start of a block group and @offset is
actually an address in logical space. Thus, we need to consider
block_group->start when calculating them.

Fixes: 011b41bffa3d ("btrfs: zoned: advance allocation pointer after tree log node")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ba86dd9f 08-Aug-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: zoned: suppress reclaim error message on EAGAIN

btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
running. The message can fail btrfs/187 and it's unnecessary because we
anyway add it back to the reclaim list.

btrfs_reclaim_bgs_work()
`-> btrfs_relocate_chunk()
`-> btrfs_relocate_block_group()
`-> reloc_chunk_start()
`-> if (fs_info->send_in_progress)
`-> return -EAGAIN

CC: stable@vger.kernel.org # 5.13+
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

77233c2d 09-Aug-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: allow disabling of zone auto reclaim

Automatically reclaiming dirty zones might not always be desired for all
workloads, especially as there are currently still some rough edges with
the relocation code on zoned filesystems.

Allow disabling zone auto reclaim on a per filesystem basis by writing 0
as the threshold value.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1f295373 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: update comment at log_conflicting_inodes()

A comment at log_conflicting_inodes() mentions that we check the inode's
logged_trans field instead of using btrfs_inode_in_log() because the field
last_log_commit is not updated when we log that an inode exists and the
inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
about the full sync flag is not true anymore since commit 9acc8103ab594f
("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
update the comment to not mention that part anymore.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d135a533 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove no longer needed full sync flag check at inode_logged()

Now that we are checking if the inode's logged_trans is 0 to detect the
possibility of the inode having been evicted and reloaded, the test for
the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
tree-log.c:inode_logged(). Its purpose was to detect the possibility
of a previous eviction as well, since when an inode is loaded the full
sync flag is always set on it (and only cleared after the inode is
logged).

So just remove the check and update the comment. The check for the inode's
logged_trans being 0 was added recently by the patch with the subject
"btrfs: eliminate some false positives when checking if inode was logged".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1c167b87 29-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: remove unnecessary NULL check for the new inode during rename exchange

At the very end of btrfs_rename_exchange(), in case an error happened, we
are checking if 'new_inode' is NULL, but that is not needed since during a
rename exchange, unlike regular renames, 'new_inode' can never be NULL,
and if it were, we would have a crashed much earlier when we dereference it
multiple times.

So remove the check because it is not necessary and because it is causing
static checkers to emit a warning. I probably introduced the check by
copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
NULL, in commit 86e8aa0e772cab ("Btrfs: unpin logs if rename exchange
operation fails").

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

dce28150 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate backref_ctx on stack in find_extent_clone

Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
on stack. The size is reasonably small.

sizeof(backref_ctx) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c853a578 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate btrfs_ioctl_defrag_range_args on stack

Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_defrag_range_args) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0afb603a 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate btrfs_ioctl_quota_rescan_args on stack

Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_quota_rescan_args) = 64

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

98caf953 27-Jul-2021 Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs: allocate file_ra_state on stack in readahead_cache

Instead of allocating file_ra_state using kmalloc, allocate on stack.
sizeof(struct readahead) = 32 bytes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0ff40a91 29-Jul-2021 Marcos Paulo de Souza <mpdesouza@suse.com>

btrfs: introduce btrfs_search_backwards function

It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.

Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.

The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ea3dc7d2 28-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: print if fsverity support is built in when loading module

As fsverity support depends on a config option, print that at module
load time like we do for similar features.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

70524253 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: verity metadata orphan items

Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

14605409 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: initial fsverity support

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

77eea05e 30-Jun-2021 Boris Burkov <boris@bur.io>

btrfs: add ro compat flags to inodes

Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).

Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

efc222f8 27-Jul-2021 Anand Jain <anand.jain@oracle.com>

btrfs: simplify return values in btrfs_check_raid_min_devices

Function btrfs_check_raid_min_devices() returns error code from the enum
btrfs_err_code and it starts from 1. So there is no need to check if ret
is > 0. So drop this check and also drop the local variable ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7361b4ae 28-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: remove the dead comment in writepage_delalloc()

When btrfs_run_delalloc_range() failed, we will error out.

But there is a strange comment mentioning that
btrfs_run_delalloc_range() could have returned value >0 to indicate the
IO has already started.

Commit 40f765805f08 ("Btrfs: split up __extent_writepage to lower stack
usage") introduced the comment, but unfortunately at that time, we were
already using @page_started to indicate that case, and still return 0.

Furthermore, even if that comment was right (which is not), we would
return -EIO if the IO had already started.

By all means the comment is incorrect, just remove the comment along
with the dead check.

Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
make sure we either return 0 or error, no positive return value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b2f78e88 22-Jul-2021 David Sterba <dsterba@suse.com>

btrfs: allow degenerate raid0/raid10

The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.

Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.

The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.

The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.

Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.

Example output:

# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no

Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB

# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB

Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bd54f381 27-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: do not pin logs too early during renames

During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.

In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.

So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.

The following script that uses dbench was used for testing:

$ cat dbench-test.sh
#!/bin/bash

DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"

echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 120 16

umount $MNT

The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
and using a non-debug kernel config (Debian's default config).

The results compare a branch without this patch and without the previous
patch in the series, that has the subject:

"btrfs: eliminate some false positives when checking if inode was logged"

Versus the same branch with these two patches applied.

dbench with 8 clients, results before:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4391359 0.009 249.745
Close 3225882 0.001 3.243
Rename 185953 0.065 240.643
Unlink 886669 0.049 249.906
Deltree 112 2.455 217.433
Mkdir 56 0.002 0.004
Qpathinfo 3980281 0.004 3.109
Qfileinfo 697579 0.001 0.187
Qfsinfo 729780 0.002 2.424
Sfileinfo 357764 0.004 1.415
Find 1538861 0.016 4.863
WriteX 2189666 0.010 3.327
ReadX 6883443 0.002 0.729
LockX 14298 0.002 0.073
UnlockX 14298 0.001 0.042
Flush 307777 2.447 303.663

Throughput 1149.6 MB/sec 8 clients 8 procs max_latency=303.666 ms

dbench with 8 clients, results after:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4269920 0.009 213.532
Close 3136653 0.001 0.690
Rename 180805 0.082 213.858
Unlink 862189 0.050 172.893
Deltree 112 2.998 218.328
Mkdir 56 0.002 0.003
Qpathinfo 3870158 0.004 5.072
Qfileinfo 678375 0.001 0.194
Qfsinfo 709604 0.002 0.485
Sfileinfo 347850 0.004 1.304
Find 1496310 0.017 5.504
WriteX 2129613 0.010 2.882
ReadX 6693066 0.002 1.517
LockX 13902 0.002 0.075
UnlockX 13902 0.001 0.055
Flush 299276 2.511 220.189

Throughput 1187.33 MB/sec 8 clients 8 procs max_latency=220.194 ms

+3.2% throughput, -31.8% max latency

dbench with 16 clients, results before:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5978334 0.028 156.507
Close 4391598 0.001 1.345
Rename 253136 0.241 155.057
Unlink 1207220 0.182 257.344
Deltree 160 6.123 36.277
Mkdir 80 0.003 0.005
Qpathinfo 5418817 0.012 6.867
Qfileinfo 949929 0.001 0.941
Qfsinfo 993560 0.002 1.386
Sfileinfo 486904 0.004 2.829
Find 2095088 0.059 8.164
WriteX 2982319 0.017 9.029
ReadX 9371484 0.002 4.052
LockX 19470 0.002 0.461
UnlockX 19470 0.001 0.990
Flush 418936 2.740 347.902

Throughput 1495.31 MB/sec 16 clients 16 procs max_latency=347.909 ms

dbench with 16 clients, results after:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5711833 0.029 131.240
Close 4195897 0.001 1.732
Rename 241849 0.204 147.831
Unlink 1153341 0.184 231.322
Deltree 160 6.086 30.198
Mkdir 80 0.003 0.021
Qpathinfo 5177011 0.012 7.150
Qfileinfo 907768 0.001 0.793
Qfsinfo 949205 0.002 1.431
Sfileinfo 465317 0.004 2.454
Find 2001541 0.058 7.819
WriteX 2850661 0.017 9.110
ReadX 8952289 0.002 3.991
LockX 18596 0.002 0.655
UnlockX 18596 0.001 0.179
Flush 400342 2.879 293.607

Throughput 1565.73 MB/sec 16 clients 16 procs max_latency=293.611 ms

+4.6% throughput, -16.9% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6e8e777d 27-Jul-2021 Filipe Manana <fdmanana@suse.com>

btrfs: eliminate some false positives when checking if inode was logged

When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:

1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
unnecessarily join a log transaction and do deletion searches in a log
tree that will not find anything. This just adds unnecessary contention
on extent buffer locks;

2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
needed. If the inode was not logged before, we don't need to log it in
LOG_INODE_EXISTS mode.

So just make sure that any false positive only happens when ->logged_trans
has a value of 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

42b5d73b 21-Jul-2021 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: drop unnecessary ASSERT from btrfs_submit_direct()

When on SINGLE block group, btrfs_get_io_geometry() will return "the
size of the block group - the offset of the logical address within the
block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
INT_MAX);".

The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
which both take "int" (now u64 due to the previous patch). So to be
precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
clone_len is already capped by bio.bi_iter.bi_size which is unsigned
int. So the ASSERT is not necessary.

Drop the ASSERT and properly compare submit_len and geom.len in u64.
Then, let the implicit casting to convert it to u64.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

21dda654 21-Jul-2021 Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

btrfs: fix argument type of btrfs_bio_clone_partial()

The offset and can never be negative use unsigned int instead of int
type for them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5662c967 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

fs: kill sync_inode

Now that all users of sync_inode() have been deleted, remove
sync_inode().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

25d23cd0 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

9p: migrate from sync_inode to filemap_fdatawrite_wbc

We're going to remove sync_inode, so migrate to filemap_fdatawrite_wbc
instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b3776305 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking

sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure. If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether. What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e1646070 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: wait on async extents when flushing delalloc

I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem. When we switched to the per-inode in
38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set. This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc. The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all. This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers. This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc. This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2ca7 ("btrfs: shrink delalloc pages instead of full inodes")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

03fe78cc 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc

We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fcdef39c 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: enable a tracepoint when we fail tickets

When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes. This adds the tracpoint
so you can easily get that information.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ac98141d 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

btrfs: wake up async_delalloc_pages waiters after submit

We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing. The reason for this is we need to see any ordered extents
that were created by our delalloc flushing. However we're waking up
before we do the submit work, which is before we create the ordered
extents. This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely. Fix this by waking us up after we've done the work to
create ordered extents.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

963e4db8 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: unify regular and subpage error paths in __extent_writepage()

[BUG]
When running btrfs/160 in a loop for subpage with experimental
compression support, it has a high chance to crash (~20%):

BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:238!
Internal error: Oops - BUG: 0 [#1] SMP
pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
Call trace:
__btrfs_add_ordered_extent+0x550/0x670 [btrfs]
btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
run_delalloc_nocow+0x81c/0x8fc [btrfs]
btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
writepage_delalloc+0xc0/0x1ac [btrfs]
__extent_writepage+0xf4/0x370 [btrfs]
extent_write_cache_pages+0x288/0x4f4 [btrfs]
extent_writepages+0x58/0xe0 [btrfs]
btrfs_writepages+0x1c/0x30 [btrfs]
do_writepages+0x60/0x110
__filemap_fdatawrite_range+0x108/0x170
filemap_fdatawrite_range+0x20/0x30
btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
__btrfs_write_out_cache+0x34c/0x480 [btrfs]
btrfs_write_out_cache+0x144/0x220 [btrfs]
btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
btrfs_sync_fs+0x64/0x1cc [btrfs]
sync_fs_one_sb+0x3c/0x50
iterate_supers+0xcc/0x1d4
ksys_sync+0x6c/0xd0
__arm64_sys_sync+0x1c/0x30
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0x4c/0xd4
do_el0_svc+0x30/0x9c
el0_svc+0x2c/0x54
el0_sync_handler+0x1a8/0x1b0
el0_sync+0x198/0x1c0
---[ end trace 336f67369ae6e0af ]---

[CAUSE]
For subpage case, we can have multiple sectors inside a page, this makes
it possible for __extent_writepage() to have part of its page submitted
before returning.

In btrfs/160, we are using dm-dust to emulate write error, this means
for certain pages, we could have everything running fine, but at the end
of __extent_writepage(), one of the submitted bios fails due to dm-dust.

Then the page is marked Error, and we change @ret from 0 to -EIO.

This makes the caller extent_write_cache_pages() to error out, without
submitting the remaining pages.

Furthermore, since we're erroring out for free space cache, it doesn't
really care about the error and will update the inode and retry the
writeback.

Then we re-run the delalloc range, and will try to insert the same
delalloc range while previous delalloc range is still hanging there,
triggering the above error.

[FIX]
The proper fix is to handle errors from __extent_writepage() properly,
by ending the remaining ordered extent.

But that fix needs the following changes:

- Know at exactly which sector the error happened
Currently __extent_writepage_io() works for the full page, can't
return at which sector we hit the error.

- Grab the ordered extent covering the failed sector

As a hotfix for subpage case, here we unify the error paths in
__extent_writepage().

In fact, the "if (PageError(page))" branch never get executed if @ret is
still 0 for non-subpage cases.

As for non-subpage case, we never submit current page in
__extent_writepage(), but only add current page into bio.
The bio can only get submitted in next page.

Thus we never get PageError() set due to IO failure, thus when we hit
the branch, @ret is never 0.

By simply removing that @ret assignment, we let subpage case ignore the
IO failure, thus only error out for fatal errors just like regular
sectorsize.

So that IO error won't be treated as fatal error not trigger the hanging
OE problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

95ea0486 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: allow read-write for 4K sectorsize on 64K page size systems

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limitations though:

- For now, subpage RW mount is still considered experimental
Thus that mount warning will still be there.

- No compression support
There are still quite some PAGE_SIZE hard coded and quite some call
sites use extent_clear_unlock_delalloc() to unlock locked_page.
This will screw up subpage helpers.

Now for subpage RW mount, no matter what mount option or inode attr is
set, all writes will not be compressed. Although reading compressed
data has no problem.

- No defrag for subpage case
The defrag support for subpage case will come in later patches, which
will also rework the defrag workflow.

- No inline extent will be created
This is mostly due to the fact that filemap_fdatawrite_range() will
trigger more write than the range specified.
In fallocate calls, this behavior can make us to writeback which can
be inlined, before we enlarge the i_size.

This is a very special corner case, and even current btrfs check won't
report error on such inline extent + regular extent.
But considering how much effort has been put to prevent such inline +
regular, I'd prefer to cut off inline extent completely until we have
a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

9d9ea1e6 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix relocation potentially overwriting last page data

[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

mkfs.btrfs -f -s 4k $dev
mount $dev -o nospace_cache $mnt
$fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
sync
btrfs balance start -d $mnt
btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
refs 2 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 1339392 count 1
extent data backref root FS_TREE objectid 259 offset 647168 count 1
item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
block group used 102400 chunk_objectid 256 flags DATA
item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

0 32K 64K 96K 100K 104K
|<------ Extent A ----->| |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update i_size to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [i_size, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

There is a special note for the invalidating part.

Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.

Reading such page can cause a deadlock, as we normally expect a dirty
page to be fully uptodate.

Thus here we flush and wait the data reloc inode before doing the hacked
invalidating. This won't cause extra overhead, as we're going to
writeback the data later anyway.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e3c62324 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix false alert when relocating partial preallocated data extents

[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

BTRFS info (device dm-3): balance: start -d
BTRFS info (device dm-3): relocating block group 13631488 flags data
BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

mkfs.btrfs -f -s 4k $dev
mount $dev -o nospace_cache $mnt
xfs_io -f -c "falloc 0 8k" $mnt/file
xfs_io -f -c "pwrite 0 4k" $mnt/file
btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
have EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

7c11d0ae 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix a potential use-after-free in writeback helper

[BUG]
There is a possible use-after-free bug when running generic/095.

BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
Faulting instruction address: 0xc000000000283654
c000000000283078 do_raw_spin_unlock+0x88/0x230
c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
c0000000009e0458 end_bio_extent_writepage+0x158/0x270
c000000000b6fd14 bio_endio+0x254/0x270
c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
c000000000b6fd14 bio_endio+0x254/0x270
c000000000b781fc blk_update_request+0x46c/0x670
c000000000b8b394 blk_mq_end_request+0x34/0x1d0
c000000000d82d1c lo_complete_rq+0x11c/0x140
c000000000b880a4 blk_complete_reqs+0x84/0xb0
c0000000012b2ca4 __do_softirq+0x334/0x680
c0000000001dd878 irq_exit+0x148/0x1d0
c000000000016f4c do_IRQ+0x20c/0x240
c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

Thread 1 | Thread 2
--------------------------------+------------------------------------
end_bio_extent_writepage() | btrfs_releasepage()
|- spin_lock_irqsave() | |
|- end_page_writeback() | |
| | |- if (PageWriteback() ||...)
| | |- clear_page_extent_mapped()
| | |- kfree(subpage);
|- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e0467866 26-Jul-2021 Qu Wenruo <wqu@suse.com>

btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()

[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

assertion failed: PagePrivate(page) && page->private
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3403!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
Hardware name: Khadas VIM3 (DT)
Call trace:
assertfail.constprop.0+0x28/0x2c [btrfs]
btrfs_subpage_assert+0x80/0xa0 [btrfs]
btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
btrfs_dirty_pages+0x160/0x270 [btrfs]
btrfs_buffered_write+0x444/0x630 [btrfs]
btrfs_direct_write+0x1cc/0x2d0 [btrfs]
btrfs_file_write_iter+0xc0/0x160 [btrfs]
new_sync_write+0xe8/0x180
vfs_write+0x1b4/0x210
ksys_pw