Searched +hist:2 +hist:f221d6f (Results 1 - 25 of 55) sorted by last modified time

123

/linux-master/fs/9p/
H A Dvfs_inode_dotl.cdiff 2dc92e59 Fri Jan 05 10:17:35 MST 2024 Eric Van Hensbergen <ericvh@kernel.org> fs/9p: Eliminate redundant non-cache path in mknod

Like symlink, mknod had a seperate path with different inode
allocation -- but this seems unnecessary, so eliminating this path.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
diff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff cf7c33d3 Wed May 03 01:49:29 MDT 2023 Dominique Martinet <asmadeus@codewreck.org> 9p: remove dead stores (variable set again without being read)

The 9p code for some reason used to initialize variables outside of the
declaration, e.g. instead of just initializing the variable like this:

int retval = 0

We would be doing this:

int retval;
retval = 0;

This is perfectly fine and the compiler will just optimize dead stores
anyway, but scan-build seems to think this is a problem and there are
many of these warnings making the output of scan-build full of such
warnings:
fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores]
retval = 0;
^ ~

I have no strong opinion here, but if we want to regularly run
scan-build we should fix these just to silence the messages.

I've confirmed these all are indeed ok to remove.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
diff d9bc0d11 Wed Dec 07 19:40:37 MST 2022 Eric Van Hensbergen <ericvh@kernel.org> fs/9p: Consolidate file operations and add readahead and writeback

We had 3 different sets of file operations across 2 different protocol
variants differentiated by cache which really only changed 3
functions. But the real problem is that certain file modes, mount
options, and other factors weren't being considered when we
decided whether or not to use caches.

This consolidates all the operations and switches
to conditionals within a common set to decide whether or not
to do different aspects of caching.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
diff beca774f Sun Jun 12 01:00:05 MDT 2022 Dominique Martinet <asmadeus@codewreck.org> 9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl

We need to release directory fid if we fail halfway through open

This fixes fid leaking with xfstests generic 531

Link: https://lkml.kernel.org/r/20220612085330.1451496-2-asmadeus@codewreck.org
Fixes: 6636b6dcc3db ("9p: add refcount to p9_fid struct")
Cc: stable@vger.kernel.org
Reported-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff 24e42e32 Wed Nov 18 02:06:42 MST 2020 David Howells <dhowells@redhat.com> 9p: Use fscache indexing rewrite and reenable caching

Change the 9p filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in 9p.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole.

(2) The session cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). That takes three parameters: a string
representing the "volume" in the index, a string naming the cache to
use (or NULL) and a u64 that conveys coherency metadata for the
volume.

For 9p, I've made it render the volume name string as:

"9p,<devname>,<cachetag>"

where the cachetag is replaced by the aname if it wasn't supplied.

This probably needs rethinking a bit as the aname can have slashes in
it. It might be better to hash the cachetag and use the hash or I
could substitute commas for the slashes or something.

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before.

(4) The functions to set/reset/flush cookies are removed and
fscache_use_cookie() and fscache_unuse_cookie() are used instead.

fscache_use_cookie() is passed a flag to indicate if the cookie is
opened for writing. fscache_unuse_cookie() is passed updates for the
metadata if we changed it (ie. if the file was opened for writing).

These are called when the file is opened or closed.

(5) wait_on_page_bit[_killable]() is replaced with the specific wait
functions for the bits waited upon.

(6) I've got rid of some of the 9p-specific cache helper functions and
called things like fscache_relinquish_cookie() directly as they'll
optimise away if v9fs_inode_cookie() returns an unconditional NULL
(which will be the case if CONFIG_9P_FSCACHE=n).

(7) v9fs_vfs_setattr() is made to call fscache_resize() to change the size
of the cache object.

Notes:

(A) We should call fscache_invalidate() if we detect that the server's
copy of a file got changed by a third party, but I don't know where to
do that. We don't need to do that when allocating the cookie as we
get a check-and-invalidate when we initially bind to the cache object.

(B) The copy-to-cache-on-writeback side of things will be handled in
separate patch.

Changes
=======
ver #3:
- Canonicalise the cookie key and coherency data to make them
endianness-independent.

ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819664645.215744.1555314582005286846.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906975017.143852.3459573173204394039.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967178512.1823006.17377493641569138183.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021573143.640689.3977487095697717967.stgit@warthog.procyon.org.uk/ # v4
diff 24e42e32 Wed Nov 18 02:06:42 MST 2020 David Howells <dhowells@redhat.com> 9p: Use fscache indexing rewrite and reenable caching

Change the 9p filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in 9p.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole.

(2) The session cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). That takes three parameters: a string
representing the "volume" in the index, a string naming the cache to
use (or NULL) and a u64 that conveys coherency metadata for the
volume.

For 9p, I've made it render the volume name string as:

"9p,<devname>,<cachetag>"

where the cachetag is replaced by the aname if it wasn't supplied.

This probably needs rethinking a bit as the aname can have slashes in
it. It might be better to hash the cachetag and use the hash or I
could substitute commas for the slashes or something.

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before.

(4) The functions to set/reset/flush cookies are removed and
fscache_use_cookie() and fscache_unuse_cookie() are used instead.

fscache_use_cookie() is passed a flag to indicate if the cookie is
opened for writing. fscache_unuse_cookie() is passed updates for the
metadata if we changed it (ie. if the file was opened for writing).

These are called when the file is opened or closed.

(5) wait_on_page_bit[_killable]() is replaced with the specific wait
functions for the bits waited upon.

(6) I've got rid of some of the 9p-specific cache helper functions and
called things like fscache_relinquish_cookie() directly as they'll
optimise away if v9fs_inode_cookie() returns an unconditional NULL
(which will be the case if CONFIG_9P_FSCACHE=n).

(7) v9fs_vfs_setattr() is made to call fscache_resize() to change the size
of the cache object.

Notes:

(A) We should call fscache_invalidate() if we detect that the server's
copy of a file got changed by a third party, but I don't know where to
do that. We don't need to do that when allocating the cookie as we
get a check-and-invalidate when we initially bind to the cache object.

(B) The copy-to-cache-on-writeback side of things will be handled in
separate patch.

Changes
=======
ver #3:
- Canonicalise the cookie key and coherency data to make them
endianness-independent.

ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819664645.215744.1555314582005286846.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906975017.143852.3459573173204394039.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967178512.1823006.17377493641569138183.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021573143.640689.3977487095697717967.stgit@warthog.procyon.org.uk/ # v4
diff 024b7d6a Tue Nov 02 07:12:01 MDT 2021 Dominique Martinet <asmadeus@codewreck.org> 9p: fix file headers

- add missing SPDX-License-Identifier
- remove (sometimes incorrect) file name from file header

Link: http://lkml.kernel.org/r/20211102134608.1588018-2-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff bc868036 Mon Oct 04 15:07:22 MDT 2021 David Howells <dhowells@redhat.com> 9p: Fix a bunch of kerneldoc warnings shown up by W=1

Fix a bunch of kerneldoc warnings shown up by W=1 in the 9p filesystem:

(1) Add/remove/fix kerneldoc parameters descriptions.

(2) Move __add_fid() from between v9fs_fid_add() and its comment.

(3) 9p's caches_show() doesn't really make sense as an API function, so
remove the kerneldoc annotation. It's also not prefixed with 'v9fs_'.
Also remove the kerneldoc markers from the 9p fscache wrappers.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Mauro Carvalho Chehab <mchehab@kernel.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
cc: linux-doc@vger.kernel.org
Link: https://lore.kernel.org/r/163214005516.2945267.7000234432243167892.stgit@warthog.procyon.org.uk/ # rfc v1
Link: https://lore.kernel.org/r/163281899704.2790286.9177774252843775348.stgit@warthog.procyon.org.uk/ # rfc v2
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
H A Dvfs_inode.cdiff 100ccd18 Fri Nov 24 06:39:02 MST 2023 David Howells <dhowells@redhat.com> netfs: Optimise away reads above the point at which there can be no data

Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.

Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:

(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.

(2) On local inode creation, we set zero_point to 0.

(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.

(4) On local truncation up, we don't change zero_point.

(5) On local modification, we don't change zero_point.

(6) On remote invalidation, we set zero_point to the new i_size.

(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.

(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.

(9) If a direct I/O write is made, set zero_point above that.

Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.

The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
diff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff cf7c33d3 Wed May 03 01:49:29 MDT 2023 Dominique Martinet <asmadeus@codewreck.org> 9p: remove dead stores (variable set again without being read)

The 9p code for some reason used to initialize variables outside of the
declaration, e.g. instead of just initializing the variable like this:

int retval = 0

We would be doing this:

int retval;
retval = 0;

This is perfectly fine and the compiler will just optimize dead stores
anyway, but scan-build seems to think this is a problem and there are
many of these warnings making the output of scan-build full of such
warnings:
fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores]
retval = 0;
^ ~

I have no strong opinion here, but if we want to regularly run
scan-build we should fix these just to silence the messages.

I've confirmed these all are indeed ok to remove.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
diff d9bc0d11 Wed Dec 07 19:40:37 MST 2022 Eric Van Hensbergen <ericvh@kernel.org> fs/9p: Consolidate file operations and add readahead and writeback

We had 3 different sets of file operations across 2 different protocol
variants differentiated by cache which really only changed 3
functions. But the real problem is that certain file modes, mount
options, and other factors weren't being considered when we
decided whether or not to use caches.

This consolidates all the operations and switches
to conditionals within a common set to decide whether or not
to do different aspects of caching.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 874c8ca1 Thu Jun 09 14:46:04 MDT 2022 David Howells <dhowells@redhat.com> netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/fs/btrfs/
H A Dinode.cdiff 74e97958 Thu Mar 21 11:02:04 MDT 2024 Boris Burkov <boris@bur.io> btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations

Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.

However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.

The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:

1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.

This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.

Fixes: e85fde5162bf ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 2b712e3b Thu Jan 25 09:44:47 MST 2024 David Sterba <dsterba@suse.com> btrfs: remove unused included headers

With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 1bd96c92 Thu Feb 01 17:09:22 MST 2024 Filipe Manana <fdmanana@suse.com> btrfs: reject encoded write if inode has nodatasum flag set

Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.

For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
root 256 inode 257 errors 1040, bad file extent, some csum missing
root 256 inode 258 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
(...)

So reject encoded writes if the target inode has NODATASUM set.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 3324d054 Thu Jan 04 12:48:47 MST 2024 Omar Sandoval <osandov@fb.com> btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted

Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:

Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |

This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.

create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.

CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 00deaf04 Mon Dec 04 09:20:29 MST 2023 Filipe Manana <fdmanana@suse.com> btrfs: log messages at unpin_extent_range() during unexpected cases

At unpin_extent_range() we trigger a WARN_ON() when we don't find an
extent map or we find one with a start offset not matching the start
offset of the target range. This however isn't very useful for debugging
because:

1) We don't know which condition was triggered, as they are both in the
same WARN_ON() call;

2) We don't know which inode was affected, from which root, for which
range, what's the start offset of the extent map, and so on.

So trigger a separate warning for each case and log a message for each
case providing information about the inode, its root, the target range,
the generation and the start offset of the extent map we found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 7dc66abb Tue Nov 21 06:38:38 MST 2023 Filipe Manana <fdmanana@suse.com> btrfs: use a dedicated data structure for chunk maps

Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.

Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.

We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 7dc66abb Tue Nov 21 06:38:38 MST 2023 Filipe Manana <fdmanana@suse.com> btrfs: use a dedicated data structure for chunk maps

Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.

Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.

We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 7dc66abb Tue Nov 21 06:38:38 MST 2023 Filipe Manana <fdmanana@suse.com> btrfs: use a dedicated data structure for chunk maps

Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.

Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.

We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 7dc66abb Tue Nov 21 06:38:38 MST 2023 Filipe Manana <fdmanana@suse.com> btrfs: use a dedicated data structure for chunk maps

Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.

Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.

We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
diff 776a838f Tue Oct 17 02:00:31 MDT 2023 Naohiro Aota <naohiro.aota@wdc.com> btrfs: zoned: wait for data BG to be finished on direct IO allocation

Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.

$ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
--rw=write --ioengine=libaio --iodepth=128 --direct=1

fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
...

This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.

btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.

This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
/linux-master/mm/
H A Dshmem.cdiff a4af51ce Tue Feb 06 19:56:15 MST 2024 Kent Overstreet <kent.overstreet@linux.dev> fs: super_set_uuid()

Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.

And add a helper super_set_uuid(), for setting nonstandard length uuids.

Helper is now required for the new FS_IOC_GETUUID ioctl; if
super_set_uuid() hasn't been called, the ioctl won't be supported.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff beb98686 Fri Sep 29 21:42:45 MDT 2023 Hugh Dickins <hughd@google.com> shmem,percpu_counter: add _limited_add(fbc, limit, amount)

Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit. Imagine all the other CPUs fallocating tmpfs
huge pages to the limit, in between this CPU's compare and its add.

I have not seen reports of that happening; but tmpfs's recent addition of
dquot_alloc_block_nodirty() in between the compare and the add makes it
even more likely, and I'd be uncomfortable to leave it unfixed.

Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.

I believe this implementation is correct, and slightly more efficient than
the combination of compare and add (taking the lock once rather than twice
when nearing full - the last 128MiB of a tmpfs volume on a machine with
128 CPUs and 4KiB pages); but it does beg for a better design - when
nearing full, there is no new batching, but the costly percpu counter sum
across CPUs still has to be done, while locked.

Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well
as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs? But the times when it matters would be vanishingly rare.

Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 3022fd7a Fri Sep 29 21:32:40 MDT 2023 Hugh Dickins <hughd@google.com> shmem: _add_to_page_cache() before shmem_inode_acct_blocks()

There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
EFAULT) even though all allocations were within the permitted size.

This was a problem since early days, but magnified and complicated by the
addition of huge pages. We have often worked around it by adding some
slop to the tmpfs size, but it's hard to say how much is needed, and some
users prefer not to do that e.g. keeping sparse files in a tightly
tailored tmpfs helps to prevent accidental writing to holes.

This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).

Concurrent tasks allocating at the same position could deplete the size
allowance and fail. Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.

1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.

The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.

Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
splits twice (once huge, once nonhuge) is not exactly equivalent to trying
5 splits (and giving up early on huge), but let's keep it simple unless
more complication proves necessary.

Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock. Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.

Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 3022fd7a Fri Sep 29 21:32:40 MDT 2023 Hugh Dickins <hughd@google.com> shmem: _add_to_page_cache() before shmem_inode_acct_blocks()

There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
EFAULT) even though all allocations were within the permitted size.

This was a problem since early days, but magnified and complicated by the
addition of huge pages. We have often worked around it by adding some
slop to the tmpfs size, but it's hard to say how much is needed, and some
users prefer not to do that e.g. keeping sparse files in a tightly
tailored tmpfs helps to prevent accidental writing to holes.

This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).

Concurrent tasks allocating at the same position could deplete the size
allowance and fail. Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.

1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.

The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.

Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
splits twice (once huge, once nonhuge) is not exactly equivalent to trying
5 splits (and giving up early on huge), but let's keep it simple unless
more complication proves necessary.

Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock. Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.

Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 3022fd7a Fri Sep 29 21:32:40 MDT 2023 Hugh Dickins <hughd@google.com> shmem: _add_to_page_cache() before shmem_inode_acct_blocks()

There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
EFAULT) even though all allocations were within the permitted size.

This was a problem since early days, but magnified and complicated by the
addition of huge pages. We have often worked around it by adding some
slop to the tmpfs size, but it's hard to say how much is needed, and some
users prefer not to do that e.g. keeping sparse files in a tightly
tailored tmpfs helps to prevent accidental writing to holes.

This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).

Concurrent tasks allocating at the same position could deplete the size
allowance and fail. Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.

1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.

The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.

Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
splits twice (once huge, once nonhuge) is not exactly equivalent to trying
5 splits (and giving up early on huge), but let's keep it simple unless
more complication proves necessary.

Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock. Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.

Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2f502860 Fri Sep 29 23:00:32 MDT 2023 Wedson Almeida Filho <walmeida@microsoft.com> shmem: move shmem_xattr_handlers to .rodata

This makes it harder for accidental or malicious changes to
shmem_xattr_handlers at runtime.

Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-29-wedsonaf@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff af19487f Fri Jul 07 15:55:33 MDT 2023 Axel Rasmussen <axelrasmussen@google.com> mm: make PTE_MARKER_SWAPIN_ERROR more general

Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
v4.

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.


This patch (of 8):

Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:

First, rename it to just PTE_MARKER_POISONED. The "SWAPIN" can be
confusing since we're going to re-use it for something not really related
to swap. This can be particularly confusing for things like hugetlbfs,
which doesn't support swap whatsoever. Also rename some various helper
functions.

Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today. Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places. While we're at it, also make it slightly more explicit in
its handling of e.g. uffd wp markers.

For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
error entry, return VM_FAULT_HWPOISON. For most cases this change doesn't
matter, e.g. a userspace program would receive a SIGBUS either way. But
for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
box, instead of giving a SIGBUS to the hypervisor and requiring it to
somehow inject an MCE.

Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today
because the lack of swap support means we'll never end up with such a PTE
anyway, but this behavior will be needed once such entries *can* show up
via UFFDIO_POISON.

Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2daf18a7 Tue Aug 08 22:33:56 MDT 2023 Hugh Dickins <hughd@google.com> tmpfs,xattr: enable limited user extended attributes

Enable "user." extended attributes on tmpfs, limiting them by tracking
the space they occupy, and deducting that space from the limited ispace
(unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited).

tmpfs inodes and simple xattrs are both unswappable, and have to be in
lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate
for xattrs, without any need for a further mount option.

Add simple_xattr_space() to give approximate but deterministic estimate
of the space taken up by each xattr: with simple_xattrs_free() outputting
the space freed if required (but kernfs and even some tmpfs usages do not
require that, so don't waste time on strlen'ing if not needed).

Security and trusted xattrs were already supported: for consistency and
simplicity, account them from the same pool; though there's a small risk
that a tmpfs with enough space before would now be considered too small.

When extended attributes are used, "df -i" does show more IUsed and less
IFree than can be explained by the inodes: document that (manpage later).

xfstests tests/generic which were not run on tmpfs before but now pass:
020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728
with no new failures.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2daf18a7 Tue Aug 08 22:33:56 MDT 2023 Hugh Dickins <hughd@google.com> tmpfs,xattr: enable limited user extended attributes

Enable "user." extended attributes on tmpfs, limiting them by tracking
the space they occupy, and deducting that space from the limited ispace
(unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited).

tmpfs inodes and simple xattrs are both unswappable, and have to be in
lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate
for xattrs, without any need for a further mount option.

Add simple_xattr_space() to give approximate but deterministic estimate
of the space taken up by each xattr: with simple_xattrs_free() outputting
the space freed if required (but kernfs and even some tmpfs usages do not
require that, so don't waste time on strlen'ing if not needed).

Security and trusted xattrs were already supported: for consistency and
simplicity, account them from the same pool; though there's a small risk
that a tmpfs with enough space before would now be considered too small.

When extended attributes are used, "df -i" does show more IUsed and less
IFree than can be explained by the inodes: document that (manpage later).

xfstests tests/generic which were not run on tmpfs before but now pass:
020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728
with no new failures.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5de75970 Tue Aug 08 22:30:59 MDT 2023 Hugh Dickins <hughd@google.com> xattr: simple_xattr_set() return old_xattr to be freed

tmpfs wants to support limited user extended attributes, but kernfs
(or cgroupfs, the only kernfs with KERNFS_ROOT_SUPPORT_USER_XATTR)
already supports user extended attributes through simple xattrs: but
limited by a policy (128KiB per inode) too liberal to be used on tmpfs.

To allow a different limiting policy for tmpfs, without affecting the
policy for kernfs, change simple_xattr_set() to return the replaced or
removed xattr (if any), leaving the caller to update their accounting
then free the xattr (by simple_xattr_free(), renamed from the static
free_simple_xattr()).

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <158c6585-2aa7-d4aa-90ff-f7c3f8fe407c@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
/linux-master/fs/fuse/
H A Ddir.cdiff 2d09ab22 Fri Jan 05 08:21:27 MST 2024 Alexander Mikhalitsyn <alexander@mihalicyn.com> fuse: fix typo for fuse_permission comment

Found by chance while working on support for idmapped mounts in fuse.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 972f4c46 Wed Aug 09 16:45:05 MDT 2023 Miklos Szeredi <mszeredi@redhat.com> fuse: cache btime

Not all inode attributes are supported by all filesystems, but for the
basic stats (which are returned by stat(2) and friends) all of them will
have some value, even if that doesn't reflect a real attribute of the file.

Btime is different, in that filesystems are free to report or not report a
value in statx. If the value is available, then STATX_BTIME bit is set in
stx_mask.

When caching the value of btime, remember the availability of the attribute
as well as the value (if available). This is done by using the
FUSE_I_BTIME bit in fuse_inode->state to indicate availability, while using
fuse_inode->inval_mask & STATX_BTIME to indicate the state of the cache
itself (i.e. set if cache is invalid, and cleared if cache is valid).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff d3045530 Wed Aug 09 16:45:05 MDT 2023 Miklos Szeredi <mszeredi@redhat.com> fuse: implement statx

Allow querying btime. When btime is requested in mask, then FUSE_STATX
request is sent. Otherwise keep using FUSE_GETATTR.

The userspace interface for statx matches that of the statx(2) API.
However there are limitations on how this interface is used:

- returned basic stats and btime are used, stx_attributes, etc. are
ignored

- always query basic stats and btime, regardless of what was requested

- requested sync type is ignored, the default is passed to the server

- if server returns with some attributes missing from the result_mask,
then no attributes will be cached

- btime is not cached yet (next patch will fix that)

For new inodes initialize fi->inval_mask to "all invalid", instead of "all
valid" as previously. Also only clear basic stats from inval_mask when
caching attributes. This will result in the caching logic not thinking
that btime is cached.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff facd6105 Thu Jan 19 16:55:04 MST 2023 Christian Brauner <brauner@kernel.org> fuse: fixes after adapting to new posix acl api

This cycle we ported all filesystems to the new posix acl api. While
looking at further simplifications in this area to remove the last
remnants of the generic dummy posix acl handlers we realized that we
regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use
of posix acls.

With the change to a dedicated posix acl api interacting with posix acls
doesn't go through the old xattr codepaths anymore and instead only
relies the get acl and set acl inode operations.

Before this change fuse daemons that don't set FUSE_POSIX_ACL were able
to get and set posix acl albeit with two caveats. First, that posix acls
aren't cached. And second, that they aren't used for permission checking
in the vfs.

We regressed that use-case as we currently refuse to retrieve any posix
acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons
would see a change in behavior.

We can restore the old behavior in multiple ways. We could change the
new posix acl api and look for a dedicated xattr handler and if we find
one prefer that over the dedicated posix acl api. That would break the
consistency of the new posix acl api so we would very much prefer not to
do that.

We could introduce a new ACL_*_CACHE sentinel that would instruct the
vfs permission checking codepath to not call into the filesystem and
ignore acls.

But a more straightforward fix for v6.2 is to do the same thing that
Overlayfs does and give fuse a separate get acl method for permission
checking. Overlayfs uses this to express different needs for vfs
permission lookup and acl based retrieval via the regular system call
path as well. Let fuse do the same for now. This way fuse can continue
to refuse to retrieve posix acls for daemons that don't set
FUSE_POSXI_ACL for permission checking while allowing a fuse server to
retrieve it via the usual system calls.

In the future, we could extend the get acl inode operation to not just
pass a simple boolean to indicate rcu lookup but instead make it a flag
argument. Then in addition to passing the information that this is an
rcu lookup to the filesystem we could also introduce a flag that tells
the filesystem that this is a request from the vfs to use these acls for
permission checking. Then fuse could refuse the get acl request for
permission checking when the daemon doesn't have FUSE_POSIX_ACL set in
the same get acl method. This would also help Overlayfs and allow us to
remove the second method for it as well.

But since that change is more invasive as we need to update the get acl
inode operation for multiple filesystems we should not do this as a fix
for v6.2. Instead we will do this for the v6.3 merge window.

Fwiw, since posix acls are now always correctly translated in the new
posix acl api we could also allow them to be used for daemons without
FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral
change and again if dones should be done for v6.3. For now, let's just
restore the original behavior.

A nice side-effect of this change is that for fuse daemons with and
without FUSE_POSIX_ACL the same code is used for posix acls in a
backwards compatible way. This also means we can remove the legacy xattr
handlers completely. We've also added comments to explain the expected
behavior for daemons without FUSE_POSIX_ACL into the code.

Fixes: 318e66856dde ("xattr: use posix acl api")
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
diff facd6105 Thu Jan 19 16:55:04 MST 2023 Christian Brauner <brauner@kernel.org> fuse: fixes after adapting to new posix acl api

This cycle we ported all filesystems to the new posix acl api. While
looking at further simplifications in this area to remove the last
remnants of the generic dummy posix acl handlers we realized that we
regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use
of posix acls.

With the change to a dedicated posix acl api interacting with posix acls
doesn't go through the old xattr codepaths anymore and instead only
relies the get acl and set acl inode operations.

Before this change fuse daemons that don't set FUSE_POSIX_ACL were able
to get and set posix acl albeit with two caveats. First, that posix acls
aren't cached. And second, that they aren't used for permission checking
in the vfs.

We regressed that use-case as we currently refuse to retrieve any posix
acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons
would see a change in behavior.

We can restore the old behavior in multiple ways. We could change the
new posix acl api and look for a dedicated xattr handler and if we find
one prefer that over the dedicated posix acl api. That would break the
consistency of the new posix acl api so we would very much prefer not to
do that.

We could introduce a new ACL_*_CACHE sentinel that would instruct the
vfs permission checking codepath to not call into the filesystem and
ignore acls.

But a more straightforward fix for v6.2 is to do the same thing that
Overlayfs does and give fuse a separate get acl method for permission
checking. Overlayfs uses this to express different needs for vfs
permission lookup and acl based retrieval via the regular system call
path as well. Let fuse do the same for now. This way fuse can continue
to refuse to retrieve posix acls for daemons that don't set
FUSE_POSXI_ACL for permission checking while allowing a fuse server to
retrieve it via the usual system calls.

In the future, we could extend the get acl inode operation to not just
pass a simple boolean to indicate rcu lookup but instead make it a flag
argument. Then in addition to passing the information that this is an
rcu lookup to the filesystem we could also introduce a flag that tells
the filesystem that this is a request from the vfs to use these acls for
permission checking. Then fuse could refuse the get acl request for
permission checking when the daemon doesn't have FUSE_POSIX_ACL set in
the same get acl method. This would also help Overlayfs and allow us to
remove the second method for it as well.

But since that change is more invasive as we need to update the get acl
inode operation for multiple filesystems we should not do this as a fix
for v6.2. Instead we will do this for the v6.3 merge window.

Fwiw, since posix acls are now always correctly translated in the new
posix acl api we could also allow them to be used for daemons without
FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral
change and again if dones should be done for v6.3. For now, let's just
restore the original behavior.

A nice side-effect of this change is that for fuse daemons with and
without FUSE_POSIX_ACL the same code is used for posix acls in a
backwards compatible way. This also means we can remove the legacy xattr
handlers completely. We've also added comments to explain the expected
behavior for daemons without FUSE_POSIX_ACL into the code.

Fixes: 318e66856dde ("xattr: use posix acl api")
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
diff 2fdbb8dd Fri Apr 22 07:48:53 MDT 2022 Miklos Szeredi <mszeredi@redhat.com> fuse: fix deadlock between atomic O_TRUNC and page invalidation

fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.

For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.

open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
inode_lock [C acquire]
fuse_set_nowrite [A acquire]

fuse_finish_open
truncate_pagecache
lock_page [B acquire]
truncate_inode_page
unlock_page [B release]

fuse_release_nowrite [A release]
inode_unlock [C release]
----------------------------------------------------------------

open()
----------------------------------------------------------------
fuse_open_common
fuse_finish_open
invalidate_inode_pages2
lock_page [B acquire]
fuse_launder_page
fuse_wait_on_page_writeback [A acquire & release]
unlock_page [B release]
----------------------------------------------------------------

Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).

Fix by moving the truncate_pagecache() call outside the nowrite protected
region. The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server. Write syscalls racing with
page cache truncation still get the inode lock protection.

This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().

Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2bf06b8e Wed Feb 09 13:21:56 MST 2022 Matthew Wilcox (Oracle) <willy@infradead.org> fuse: Convert from launder_page to launder_folio

Straightforward conversion although the helper functions still assume
a single page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
diff 5c791fe1 Fri Oct 22 09:03:01 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call

- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 5c791fe1 Fri Oct 22 09:03:01 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call

- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 5c791fe1 Fri Oct 22 09:03:01 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call

- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 5c791fe1 Fri Oct 22 09:03:01 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call

- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 5c791fe1 Fri Oct 22 09:03:01 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> fuse: make sure reclaim doesn't write the inode

In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call

- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
/linux-master/fs/zonefs/
H A Dsuper.cdiff 14db5f64 Thu Feb 08 01:26:59 MST 2024 Damien Le Moal <dlemoal@kernel.org> zonefs: Improve error handling

Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:

1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).

2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.

Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
diff 14db5f64 Thu Feb 08 01:26:59 MST 2024 Damien Le Moal <dlemoal@kernel.org> zonefs: Improve error handling

Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:

1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).

2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.

Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
diff 14db5f64 Thu Feb 08 01:26:59 MST 2024 Damien Le Moal <dlemoal@kernel.org> zonefs: Improve error handling

Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:

1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).

2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.

Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
diff 4e458869 Sun Nov 20 03:57:59 MST 2022 Zhang Xiaoxu <zhangxiaoxu5@huawei.com> zonefs: Fix race between modprobe and mount

There is a race between modprobe and mount as below:

modprobe zonefs | mount -t zonefs
--------------------------------|-------------------------
zonefs_init |
register_filesystem [1] |
| zonefs_fill_super [2]
zonefs_sysfs_init [3] |

1. register zonefs suceess, then
2. user can mount the zonefs
3. if sysfs initialize failed, the module initialize failed.

Then the mount process maybe some error happened since the module
initialize failed.

Let's register zonefs after all dependency resource ready. And
reorder the dependency resource release in module exit.

Fixes: 9277a6d4fbd4 ("zonefs: Export open zone resource information through sysfs")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
diff 4e458869 Sun Nov 20 03:57:59 MST 2022 Zhang Xiaoxu <zhangxiaoxu5@huawei.com> zonefs: Fix race between modprobe and mount

There is a race between modprobe and mount as below:

modprobe zonefs | mount -t zonefs
--------------------------------|-------------------------
zonefs_init |
register_filesystem [1] |
| zonefs_fill_super [2]
zonefs_sysfs_init [3] |

1. register zonefs suceess, then
2. user can mount the zonefs
3. if sysfs initialize failed, the module initialize failed.

Then the mount process maybe some error happened since the module
initialize failed.

Let's register zonefs after all dependency resource ready. And
reorder the dependency resource release in module exit.

Fixes: 9277a6d4fbd4 ("zonefs: Export open zone resource information through sysfs")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
diff 2ec810d5 Sun Jun 05 22:55:08 MDT 2022 Matthew Wilcox (Oracle) <willy@infradead.org> mm/migrate: Add filemap_migrate_folio()

There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
diff ff07a02e Thu Jul 14 12:06:27 MDT 2022 Bart Van Assche <bvanassche@acm.org> treewide: Rename enum req_opf into enum req_op

The type name enum req_opf is misleading since it suggests that values of
this type include both an operation type and flags. Since values of this
type represent an operation only, change the type name into enum req_op.

Convert the enum req_op documentation into kernel-doc format. Move a few
definitions such that the enum req_op documentation occurs just above
the enum req_op definition.

The name "req_opf" was introduced by commit ef295ecf090d ("block: better op
and flags encoding").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff 2b95a23c Tue Apr 12 02:00:13 MDT 2022 Damien Le Moal <damien.lemoal@opensource.wdc.com> zonefs: Rename super block information fields

The s_open_zones field of struct zonefs_sb_info is used to count the
number of files that are open for writing and may not necessarilly
correspond to the number of open zones on the device. For instance, an
application may open for writing a sequential zone file, fully write it
and keep the file open. In such case, the zone of the file is not open
anymore (it is in the full state).

Avoid confusion about this counter meaning by renaming it to
s_wro_seq_files. To keep things consistent, the field s_max_open_zones
is renamed to s_max_wro_seq_files.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
diff 2aba0d19 Thu Apr 14 22:52:46 MDT 2022 Christoph Hellwig <hch@lst.de> block: add a bdev_max_zone_append_sectors helper

Add a helper to check the max supported sectors for zone append based on
the block_device instead of having to poke into the block layer internal
request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff c75e707f Fri Mar 04 10:55:56 MST 2022 Christoph Hellwig <hch@lst.de> block: remove the per-bio/request write hint

With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
/linux-master/include/linux/
H A Dfs.hdiff 2824083d Tue Feb 20 22:14:03 MST 2024 Gabriel Krisman Bertazi <krisman@suse.de> ovl: Always reject mounting over case-insensitive directories

overlayfs relies on the filesystem setting DCACHE_OP_HASH or
DCACHE_OP_COMPARE to reject mounting over case-insensitive directories.

Since commit bb9cd9106b22 ("fscrypt: Have filesystems handle their
d_ops"), we set ->d_op through a hook in ->d_lookup, which
means the root dentry won't have them, causing the mount to accidentally
succeed.

In v6.7-rc7, the following sequence will succeed to mount, but any
dentry other than the root dentry will be a "weird" dentry to ovl and
fail with EREMOTE.

mkfs.ext4 -O casefold lower.img
mount -O loop lower.img lower
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl /mnt

Mounting on a subdirectory fails, as expected, because DCACHE_OP_HASH
and DCACHE_OP_COMPARE are properly set by ->lookup.

Fix by explicitly rejecting superblocks that allow case-insensitive
dentries. Yes, this will be solved when we move d_op configuration back
to ->s_d_op. Yet, we better have an explicit fix to avoid messing up
again.

While there, re-sort the entries to have more descriptive error messages
first.

Fixes: bb9cd9106b22 ("fscrypt: Have filesystems handle their d_ops")
Acked-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20240221171412.10710-2-krisman@suse.de
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
diff a4af51ce Tue Feb 06 19:56:15 MST 2024 Kent Overstreet <kent.overstreet@linux.dev> fs: super_set_uuid()

Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.

And add a helper super_set_uuid(), for setting nonstandard length uuids.

Helper is now required for the new FS_IOC_GETUUID ioctl; if
super_set_uuid() hasn't been called, the ioctl won't be supported.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 3058fca1 Fri Feb 02 04:01:31 MST 2024 Amir Goldstein <amir73il@gmail.com> fs: make file_dentry() a simple accessor

file_dentry() is a relic from the days that overlayfs was using files with
a "fake" path, meaning, f_path on overlayfs and f_inode on underlying fs.

In those days, file_dentry() was needed to get the underlying fs dentry
that matches f_inode.

Files with "fake" path should not exist nowadays, so make file_dentry() a
simple accessor and use an assertion to make sure that file_dentry() was
not papering over filesystem bugs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202110132.1584111-2-amir73il@gmail.com
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d3b1a9a7 Fri Feb 02 01:33:04 MST 2024 JonasZhou <JonasZhou@zhaoxin.com> fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem
may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score
of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell
improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou <JonasZhou@zhaoxin.com>
Link: https://lore.kernel.org/r/20240202083304.10995-1-JonasZhou-oc@zhaoxin.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d8f899d1 Wed Jan 24 07:28:55 MST 2024 Baokun Li <libaokun1@huawei.com> fs: make the i_size_read/write helpers be smp_load_acquire/store_release()

In [Link] Linus mentions that acquire/release makes it clear which
_particular_ memory accesses are the ordered ones, and it's unlikely
to make any performance difference, so it's much better to pair up
the release->acquire ordering than have a "wmb->rmb" ordering.

=========================================================
update pagecache
folio_mark_uptodate(folio)
smp_wmb()
set_bit PG_uptodate

=== ↑↑↑ STLR ↑↑↑ === smp_store_release(&inode->i_size, i_size)

folio_test_uptodate(folio)
test_bit PG_uptodate
smp_rmb()

=== ↓↓↓ LDAR ↓↓↓ === smp_load_acquire(&inode->i_size)

copy_page_to_iter()
=========================================================

Calling smp_store_release() in i_size_write() ensures that the data
in the page and the PG_uptodate bit are updated before the isize is
updated, and calling smp_load_acquire() in i_size_read ensures that
it will not read a newer isize than the data in the page. Therefore,
this avoids buffered read-write inconsistencies caused by Load-Load
reordering.

Link: https://lore.kernel.org/r/CAHk-=wifOnmeJq+sn+2s-P46zw0SFEbw9BSCGgp2c5fYPtRPGw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240124142857.4146716-2-libaokun1@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff d8f899d1 Wed Jan 24 07:28:55 MST 2024 Baokun Li <libaokun1@huawei.com> fs: make the i_size_read/write helpers be smp_load_acquire/store_release()

In [Link] Linus mentions that acquire/release makes it clear which
_particular_ memory accesses are the ordered ones, and it's unlikely
to make any performance difference, so it's much better to pair up
the release->acquire ordering than have a "wmb->rmb" ordering.

=========================================================
update pagecache
folio_mark_uptodate(folio)
smp_wmb()
set_bit PG_uptodate

=== ↑↑↑ STLR ↑↑↑ === smp_store_release(&inode->i_size, i_size)

folio_test_uptodate(folio)
test_bit PG_uptodate
smp_rmb()

=== ↓↓↓ LDAR ↓↓↓ === smp_load_acquire(&inode->i_size)

copy_page_to_iter()
=========================================================

Calling smp_store_release() in i_size_write() ensures that the data
in the page and the PG_uptodate bit are updated before the isize is
updated, and calling smp_load_acquire() in i_size_read ensures that
it will not read a newer isize than the data in the page. Therefore,
this avoids buffered read-write inconsistencies caused by Load-Load
reordering.

Link: https://lore.kernel.org/r/CAHk-=wifOnmeJq+sn+2s-P46zw0SFEbw9BSCGgp2c5fYPtRPGw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240124142857.4146716-2-libaokun1@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
/linux-master/fs/nfsd/
H A Dvfs.cdiff 2d8ae8c4 Tue May 02 07:36:02 MDT 2023 Christian Brauner <brauner@kernel.org> nfsd: use vfs setgid helper

We've aligned setgid behavior over multiple kernel releases. The details
can be found in commit cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2' of
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") and
commit 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0' of
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux").
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Usually ATTR_KILL_SGID is
raised in e.g., chown_common() and is subject to the
setattr_should_drop_sgid() check to determine whether the setgid bit can
be retained. Since nfsd is raising ATTR_KILL_SGID unconditionally it
will cause notify_change() to strip it even if the caller had the
necessary privileges to retain it. Ensure that nfsd only raises
ATR_KILL_SGID if the caller lacks the necessary privileges to retain the
setgid bit.

Without this patch the setgid stripping tests in LTP will fail:

> As you can see, the problem is S_ISGID (0002000) was dropped on a
> non-group-executable file while chown was invoked by super-user, while

[...]

> fchown02.c:66: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

[...]

> chown02.c:57: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

With this patch all tests pass.

Reported-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 2d8ae8c4 Tue May 02 07:36:02 MDT 2023 Christian Brauner <brauner@kernel.org> nfsd: use vfs setgid helper

We've aligned setgid behavior over multiple kernel releases. The details
can be found in commit cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2' of
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") and
commit 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0' of
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux").
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Usually ATTR_KILL_SGID is
raised in e.g., chown_common() and is subject to the
setattr_should_drop_sgid() check to determine whether the setgid bit can
be retained. Since nfsd is raising ATTR_KILL_SGID unconditionally it
will cause notify_change() to strip it even if the caller had the
necessary privileges to retain it. Ensure that nfsd only raises
ATR_KILL_SGID if the caller lacks the necessary privileges to retain the
setgid bit.

Without this patch the setgid stripping tests in LTP will fail:

> As you can see, the problem is S_ISGID (0002000) was dropped on a
> non-group-executable file while chown was invoked by super-user, while

[...]

> fchown02.c:66: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

[...]

> chown02.c:57: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

With this patch all tests pass.

Reported-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff de4eda9d Thu Sep 15 18:25:47 MDT 2022 Al Viro <viro@zeniv.linux.org.uk> use less confusing names for iov_iter direction initializers

READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff de4eda9d Thu Sep 15 18:25:47 MDT 2022 Al Viro <viro@zeniv.linux.org.uk> use less confusing names for iov_iter direction initializers

READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 2c445a0e Tue Dec 28 12:26:03 MST 2021 Chuck Lever <chuck.lever@oracle.com> NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)

Since this pointer is used repeatedly, move it to a stack variable.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff ea49dc79 Tue Jul 27 16:56:09 MDT 2021 NeilBrown <neilb@suse.de> NFSD: remove vanity comments

Including one's name in copyright claims is appropriate. Including it
in random comments is just vanity. After 2 decades, it is time for
these to be gone.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 6019ce07 Fri Mar 05 11:57:40 MST 2021 Chuck Lever <chuck.lever@oracle.com> NFSD: Add a tracepoint to record directory entry encoding

Enable watching the progress of directory encoding to capture the
timing of any issues with reading or encoding a directory. The
new tracepoint captures dirent encoding for all NFS versions.

For example, here's what a few NFSv4 directory entries might look
like:

nfsd-989 [002] 468.596265: nfsd_dirent: fh_hash=0x5d162594 ino=2 name=.
nfsd-989 [002] 468.596267: nfsd_dirent: fh_hash=0x5d162594 ino=1 name=..
nfsd-989 [002] 468.596299: nfsd_dirent: fh_hash=0x5d162594 ino=3827 name=zlib.c
nfsd-989 [002] 468.596325: nfsd_dirent: fh_hash=0x5d162594 ino=3811 name=xdiff
nfsd-989 [002] 468.596351: nfsd_dirent: fh_hash=0x5d162594 ino=3810 name=xdiff-interface.h
nfsd-989 [002] 468.596377: nfsd_dirent: fh_hash=0x5d162594 ino=3809 name=xdiff-interface.c

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff c11d7fd1 Tue Jun 23 16:39:24 MDT 2020 Frank van der Linden <fllinden@amazon.com> nfsd: take xattr bits into account for permission checks

Since the NFSv4.2 extended attributes extension defines 3 new access
bits for xattr operations, take them in to account when validating
what the client is asking for, and when checking permissions.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
H A Dnfsproc.cdiff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2a1aa489 Sun Nov 03 09:50:18 MST 2019 Arnd Bergmann <arnd@arndb.de> nfsd: pass a 64-bit guardtime to nfsd_setattr()

Guardtime handling in nfs3 differs between 32-bit and 64-bit
architectures, and uses the deprecated time_t type.

Change it to using time64_t, which behaves the same way on
64-bit and 32-bit architectures, treating the number as an
unsigned 32-bit entity with a range of year 1970 to 2106
consistently, and avoiding the y2038 overflow.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff 83a63072 Mon Aug 26 11:03:11 MDT 2019 Trond Myklebust <trondmy@gmail.com> nfsd: fix nfs read eof detection

Currently, the knfsd server assumes that a short read indicates an
end of file. That assumption is incorrect. The short read means that
either we've hit the end of file, or we've hit a read error.

In the case of a read error, the client may want to retry (as per the
implementation recommendations in RFC1813 and RFC7530), but currently it
is being told that it hit an eof.

Move the code to detect eof from version specific code into the generic
nfsd read.

Report eof only in the two following cases:
1) read() returns a zero length short read with no error.
2) the offset+length of the read is >= the file size.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 2b0143b5 Tue Mar 17 16:25:59 MDT 2015 David Howells <dhowells@redhat.com> VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 04716e66 Thu Aug 07 11:00:20 MDT 2008 J. Bruce Fields <bfields@citi.umich.edu> nfsd: permit unauthenticated stat of export root

RFC 2623 section 2.3.2 permits the server to bypass gss authentication
checks for certain operations that a client may perform when mounting.
In the case of a client that doesn't have some form of credentials
available to it on boot, this allows it to perform the mount unattended.
(Presumably real file access won't be needed until a user with
credentials logs in.)

Being slightly more lenient allows lots of old clients to access
krb5-only exports, with the only loss being a small amount of
information leaked about the root directory of the export.

This affects only v2 and v3; v4 still requires authentication for all
access.

Thanks to Peter Staubach testing against a Solaris client, which
suggesting addition of v3 getattr, to the list, and to Trond for noting
that doing so exposes no additional information.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Peter Staubach <staubach@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
diff c4d987ba Fri Oct 20 00:29:00 MDT 2006 Al Viro <viro@ftp.linux.org.uk> [PATCH] nfsd: NFSv{2,3} trivial endianness annotations for error values

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/linux-master/fs/ubifs/
H A Dfile.cdiff 2ec71843 Wed Jan 24 10:52:50 MST 2024 Matthew Wilcox (Oracle) <willy@infradead.org> ubifs: Convert write_begin_slow() to use a folio

Update to new APIs, removing several calls to compound_head() and
including support for large folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff 2cb1e089 Mon May 22 07:50:15 MDT 2023 David Howells <dhowells@redhat.com> splice: Use filemap_splice_read() instead of generic_file_splice_read()

Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff 66f4742e Tue May 31 21:00:00 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_releasepage: Remove ubifs_assert(0) to valid this process

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

The normal process cannot go to ubifs_releasepage() which means there
exists pages being private but not dirty. Reproducer[1] shows that it
could occur (which maybe related to [2]) with following process:

PA PB PC
lock(page)[PA]
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages[PA]
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage

do_truncation[PB]
truncate_setsize
i_size_write(inode, newsize) // newsize = 0

i_size = i_size_read(inode) // i_size = 0
end_index = i_size >> PAGE_SHIFT
if (page->index > end_index)
goto out // jump
out:
unlock(page) // Private, Not Dirty

generic_fadvise[PC]
lock(page)
invalidate_inode_page
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0)
// bad assertion!
unlock(page)
truncate_pagecache[PB]

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1683): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1513
UBIFS warning (ubi0:0 pid 1683): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1683 Comm: aa Not tainted 5.16.0-rc5-00184-g0bca5994cacc-dirty #308
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x67/0x1d0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
__invalidate_mapping_pages+0xb9/0x280
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
ksys_fadvise64_64+0x4c/0xb0

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215373
[2] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff 66f4742e Tue May 31 21:00:00 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_releasepage: Remove ubifs_assert(0) to valid this process

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

The normal process cannot go to ubifs_releasepage() which means there
exists pages being private but not dirty. Reproducer[1] shows that it
could occur (which maybe related to [2]) with following process:

PA PB PC
lock(page)[PA]
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages[PA]
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage

do_truncation[PB]
truncate_setsize
i_size_write(inode, newsize) // newsize = 0

i_size = i_size_read(inode) // i_size = 0
end_index = i_size >> PAGE_SHIFT
if (page->index > end_index)
goto out // jump
out:
unlock(page) // Private, Not Dirty

generic_fadvise[PC]
lock(page)
invalidate_inode_page
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0)
// bad assertion!
unlock(page)
truncate_pagecache[PB]

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1683): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1513
UBIFS warning (ubi0:0 pid 1683): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1683 Comm: aa Not tainted 5.16.0-rc5-00184-g0bca5994cacc-dirty #308
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x67/0x1d0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
__invalidate_mapping_pages+0xb9/0x280
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
ksys_fadvise64_64+0x4c/0xb0

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215373
[2] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff 66f4742e Tue May 31 21:00:00 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_releasepage: Remove ubifs_assert(0) to valid this process

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

The normal process cannot go to ubifs_releasepage() which means there
exists pages being private but not dirty. Reproducer[1] shows that it
could occur (which maybe related to [2]) with following process:

PA PB PC
lock(page)[PA]
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages[PA]
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage

do_truncation[PB]
truncate_setsize
i_size_write(inode, newsize) // newsize = 0

i_size = i_size_read(inode) // i_size = 0
end_index = i_size >> PAGE_SHIFT
if (page->index > end_index)
goto out // jump
out:
unlock(page) // Private, Not Dirty

generic_fadvise[PC]
lock(page)
invalidate_inode_page
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0)
// bad assertion!
unlock(page)
truncate_pagecache[PB]

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1683): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1513
UBIFS warning (ubi0:0 pid 1683): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1683 Comm: aa Not tainted 5.16.0-rc5-00184-g0bca5994cacc-dirty #308
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x67/0x1d0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
__invalidate_mapping_pages+0xb9/0x280
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
ksys_fadvise64_64+0x4c/0xb0

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215373
[2] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff 66f4742e Tue May 31 21:00:00 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_releasepage: Remove ubifs_assert(0) to valid this process

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

The normal process cannot go to ubifs_releasepage() which means there
exists pages being private but not dirty. Reproducer[1] shows that it
could occur (which maybe related to [2]) with following process:

PA PB PC
lock(page)[PA]
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages[PA]
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage

do_truncation[PB]
truncate_setsize
i_size_write(inode, newsize) // newsize = 0

i_size = i_size_read(inode) // i_size = 0
end_index = i_size >> PAGE_SHIFT
if (page->index > end_index)
goto out // jump
out:
unlock(page) // Private, Not Dirty

generic_fadvise[PC]
lock(page)
invalidate_inode_page
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0)
// bad assertion!
unlock(page)
truncate_pagecache[PB]

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1683): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1513
UBIFS warning (ubi0:0 pid 1683): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1683 Comm: aa Not tainted 5.16.0-rc5-00184-g0bca5994cacc-dirty #308
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x67/0x1d0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
__invalidate_mapping_pages+0xb9/0x280
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
ksys_fadvise64_64+0x4c/0xb0

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215373
[2] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff fb8bc4c7 Tue May 31 20:59:59 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_writepage: Mark page dirty after writing inode failed

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

There is a third possibility which maybe related to [1] that page is
private but not dirty caused by following process:

PA
lock(page)
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
write_inode
// fail, goto out, following codes are not executed
// do_writepage
// set_page_writeback // set Writeback
// detach_page_private // clear Private
// end_page_writeback // clear Writeback
out:
unlock(page) // Private, Not Dirty

PB
ksys_fadvise64_64
generic_fadvise
invalidate_inode_page
// page is neither Dirty nor Writeback
invalidate_complete_page
// page_has_private is true
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0) !!!

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1492): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1499
UBIFS warning (ubi0:0 pid 1492): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1492 Comm: aa Not tainted 5.16.0-rc2-00012-g7bb767dee0ba-dirty
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x7e/0x1e0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
vfs_fadvise+0x35/0x40
ksys_fadvise64_64+0x4c/0xb0

Jump [2] to find a reproducer.

[1] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215357

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff fb8bc4c7 Tue May 31 20:59:59 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_writepage: Mark page dirty after writing inode failed

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

There is a third possibility which maybe related to [1] that page is
private but not dirty caused by following process:

PA
lock(page)
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
write_inode
// fail, goto out, following codes are not executed
// do_writepage
// set_page_writeback // set Writeback
// detach_page_private // clear Private
// end_page_writeback // clear Writeback
out:
unlock(page) // Private, Not Dirty

PB
ksys_fadvise64_64
generic_fadvise
invalidate_inode_page
// page is neither Dirty nor Writeback
invalidate_complete_page
// page_has_private is true
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0) !!!

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1492): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1499
UBIFS warning (ubi0:0 pid 1492): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1492 Comm: aa Not tainted 5.16.0-rc2-00012-g7bb767dee0ba-dirty
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x7e/0x1e0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
vfs_fadvise+0x35/0x40
ksys_fadvise64_64+0x4c/0xb0

Jump [2] to find a reproducer.

[1] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215357

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff fb8bc4c7 Tue May 31 20:59:59 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_writepage: Mark page dirty after writing inode failed

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

There is a third possibility which maybe related to [1] that page is
private but not dirty caused by following process:

PA
lock(page)
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
write_inode
// fail, goto out, following codes are not executed
// do_writepage
// set_page_writeback // set Writeback
// detach_page_private // clear Private
// end_page_writeback // clear Writeback
out:
unlock(page) // Private, Not Dirty

PB
ksys_fadvise64_64
generic_fadvise
invalidate_inode_page
// page is neither Dirty nor Writeback
invalidate_complete_page
// page_has_private is true
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0) !!!

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1492): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1499
UBIFS warning (ubi0:0 pid 1492): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1492 Comm: aa Not tainted 5.16.0-rc2-00012-g7bb767dee0ba-dirty
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x7e/0x1e0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
vfs_fadvise+0x35/0x40
ksys_fadvise64_64+0x4c/0xb0

Jump [2] to find a reproducer.

[1] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215357

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
diff fb8bc4c7 Tue May 31 20:59:59 MDT 2022 Zhihao Cheng <chengzhihao1@huawei.com> ubifs: ubifs_writepage: Mark page dirty after writing inode failed

There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private

There is a third possibility which maybe related to [1] that page is
private but not dirty caused by following process:

PA
lock(page)
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)

write_cache_pages
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
write_inode
// fail, goto out, following codes are not executed
// do_writepage
// set_page_writeback // set Writeback
// detach_page_private // clear Private
// end_page_writeback // clear Writeback
out:
unlock(page) // Private, Not Dirty

PB
ksys_fadvise64_64
generic_fadvise
invalidate_inode_page
// page is neither Dirty nor Writeback
invalidate_complete_page
// page_has_private is true
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0) !!!

Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1492): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1499
UBIFS warning (ubi0:0 pid 1492): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1492 Comm: aa Not tainted 5.16.0-rc2-00012-g7bb767dee0ba-dirty
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x7e/0x1e0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
vfs_fadvise+0x35/0x40
ksys_fadvise64_64+0x4c/0xb0

Jump [2] to find a reproducer.

[1] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215357

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
/linux-master/fs/overlayfs/
H A Dcopy_up.cdiff 2c3ef4f8 Mon Nov 20 06:43:34 MST 2023 Amir Goldstein <amir73il@gmail.com> ovl: initialize ovl_copy_up_ctx.destname inside ovl_do_copy_up()

The ->destname member of struct ovl_copy_up_ctx is initialized inside
ovl_copy_up_one() to ->d_name of the overlayfs dentry being copied up
and then it may be overridden by index name inside ovl_do_copy_up().

ovl_inode_lock() in ovl_copy_up_start() and ovl_copy_up() in ovl_rename()
effectively stabilze ->d_name of the overlayfs dentry being copied up,
but ovl_inode_lock() is not held when ->d_name is being read.

It is not a correctness bug, because if ovl_do_copy_up() races with
ovl_rename() and ctx.destname is freed, we will not end up calling
ovl_do_copy_up() with the dead name reference.

The code becomes much easier to understand and to document if the
initialization of c->destname is always done inside ovl_do_copy_up(),
either to the index entry name, or to the overlay dentry ->d_name.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
diff 0f292086 Tue Dec 12 02:44:36 MST 2023 Amir Goldstein <amir73il@gmail.com> splice: return type ssize_t from all helpers

Not sure why some splice helpers return long, maybe historic reasons.
Change them all to return ssize_t to conform to the splice methods and
to the rest of the helpers.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff ca7ab482 Wed Nov 22 05:27:00 MST 2023 Amir Goldstein <amir73il@gmail.com> ovl: add permission hooks outside of do_splice_direct()

The main callers of do_splice_direct() also call rw_verify_area() for
the entire range that is being copied, e.g. by vfs_copy_file_range()
or do_sendfile() before calling do_splice_direct().

The only caller that does not have those checks for entire range is
ovl_copy_up_file(). In preparation for removing the checks inside
do_splice_direct(), add rw_verify_area() call in ovl_copy_up_file().

For extra safety, perform minimal sanity checks from rw_verify_area()
for non negative offsets also in the copy up do_splice_direct() loop
without calling the file permission hooks.

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-2-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 4f11ada1 Tue Jan 24 08:41:18 MST 2023 Miklos Szeredi <mszeredi@redhat.com> ovl: fail on invalid uid/gid mapping at copy up

If st_uid/st_gid doesn't have a mapping in the mounter's user_ns, then
copy-up should fail, just like it would fail if the mounter task was doing
the copy using "cp -a".

There's a corner case where the "cp -a" would succeed but copy up fail: if
there's a mapping of the invalid uid/gid (65534 by default) in the user
namespace. This is because stat(2) will return this value if the mapping
doesn't exist in the current user_ns and "cp -a" will in turn be able to
create a file with this uid/gid.

This behavior would be inconsistent with POSIX ACL's, which return -1 for
invalid uid/gid which result in a failed copy.

For consistency and simplicity fail the copy of the st_uid/st_gid are
invalid.

Fixes: 459c7c565ac3 ("ovl: unprivieged mounts")
Cc: <stable@vger.kernel.org> # v5.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Seth Forshee <sforshee@kernel.org>
diff baabaa50 Tue Jan 24 08:41:18 MST 2023 Miklos Szeredi <mszeredi@redhat.com> ovl: fix tmpfile leak

Missed an error cleanup.

Reported-by: syzbot+fd749a7ea127a84e0ffd@syzkaller.appspotmail.com
Fixes: 2b1a77461f16 ("ovl: use vfs_tmpfile_open() helper")
Cc: <stable@vger.kernel.org> # v6.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2b1a7746 Fri Sep 23 23:00:00 MDT 2022 Miklos Szeredi <mszeredi@redhat.com> ovl: use vfs_tmpfile_open() helper

If tmpfile is used for copy up, then use this helper to create the tmpfile
and open it at the same time. This will later allow filesystems such as
fuse to do this operation atomically.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2d343087 Thu Aug 04 11:11:15 MDT 2022 Al Viro <viro@zeniv.linux.org.uk> overlayfs: constify path

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 5b0a414d Thu Nov 04 07:04:52 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> ovl: fix filattr copy-up failure

This regression can be reproduced with ntfs-3g and overlayfs:

mkdir lower upper work overlay
dd if=/dev/zero of=ntfs.raw bs=1M count=2
mkntfs -F ntfs.raw
mount ntfs.raw lower
touch lower/file.txt
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work - overlay
mv overlay/file.txt overlay/file2.txt

mv fails and (misleadingly) prints

mv: cannot move 'overlay/file.txt' to a subdirectory of itself, 'overlay/file2.txt'

The reason is that ovl_copy_fileattr() is triggered due to S_NOATIME being
set on all inodes (by fuse) regardless of fileattr.

ovl_copy_fileattr() tries to retrieve file attributes from lower file, but
that fails because filesystem does not support this ioctl (this should fail
with ENOTTY, but ntfs-3g return EINVAL instead). This failure is
propagated to origial operation (in this case rename) that triggered the
copy-up.

The fix is to ignore ENOTTY and EINVAL errors from fileattr_get() in copy
up. This also requires turning the internal ENOIOCTLCMD into ENOTTY.

As a further measure to prevent unnecessary failures, only try the
fileattr_get/set on upper if there are any flags to copy up.

Side note: a number of filesystems set S_NOATIME (and sometimes other inode
flags) irrespective of fileattr flags. This causes unnecessary calls
during copy up, which might lead to a performance issue, especially if
latency is high. To fix this, the kernel would need to differentiate
between the two cases. E.g. introduce SB_NOATIME_UPDATE, a per-sb variant
of S_NOATIME. SB_NOATIME doesn't work, because that's interpreted as
"filesystem doesn't store an atime attribute"

Reported-and-tested-by: Kevin Locke <kevin@kevinlocke.name>
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
H A Dsuper.cdiff 2bc5e5e8 Thu Oct 12 09:36:57 MDT 2023 Christian Brauner <brauner@kernel.org> ovl: rely on SB_I_NOUMASK

In commit f61b9bb3f838 ("fs: add a new SB_I_NOUMASK flag") we added a
new SB_I_NOUMASK flag that is used by filesystems like NFS to indicate
that umask stripping is never supposed to be done in the vfs independent
of whether or not POSIX ACLs are supported.

Overlayfs falls into the same category as it raises SB_POSIXACL
unconditionally to defer umask application to the upper filesystem.

Now that we have SB_I_NOUMASK use that and make SB_POSIXACL properly
conditional on whether or not the kernel does have support for it. This
will enable use to turn IS_POSIXACL() into nop on kernels that don't
have POSIX ACL support avoding bugs from missed umask stripping.

Link: https://lore.kernel.org/r/20231012-einband-uferpromenade-80541a047a1f@brauner
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff ceecc2d8 Tue Jun 20 05:42:38 MDT 2023 Christian Brauner <brauner@kernel.org> ovl: reserve ability to reconfigure mount options with new mount api

Using the old mount api to remount an overlayfs superblock via
mount(MS_REMOUNT) all mount options will be silently ignored. For
example, if you create an overlayfs mount:

mount -t overlay overlay -o lowerdir=/mnt/a:/mnt/b,upperdir=/mnt/upper,workdir=/mnt/work /mnt/merged

and then issue a remount via:

# force mount(8) to use mount(2)
export LIBMOUNT_FORCE_MOUNT2=always
mount -t overlay overlay -o remount,WOOTWOOT,lowerdir=/DOESNT-EXIST /mnt/merged

with completely nonsensical mount options whatsoever it will succeed
nonetheless. This prevents us from every changing any mount options we
might introduce in the future that could reasonably be changed during a
remount.

We don't need to carry this issue into the new mount api port. Similar
to FUSE we can use the fs_context::oldapi member to figure out that this
is a request coming through the legacy mount api. If we detect it we
continue silently ignoring all mount options.

But for the new mount api we simply report that mount options cannot
currently be changed. This will allow us to potentially alter mount
properties for new or even old properties. It any case, silently
ignoring everything is not something new apis should do.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
diff b36a5780 Fri Jun 16 07:54:19 MDT 2023 Christian Brauner <brauner@kernel.org> ovl: modify layer parameter parsing

We ran into issues where mount(8) passed multiple lower layers as one
big string through fsconfig(). But the fsconfig() FSCONFIG_SET_STRING
option is limited to 256 bytes in strndup_user(). While this would be
fixable by extending the fsconfig() buffer I'd rather encourage users to
append layers via multiple fsconfig() calls as the interface allows
nicely for this. This has also been requested as a feature before.

With this port to the new mount api the following will be possible:

fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", "/lower1", 0);

/* set upper layer */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "upperdir", "/upper", 0);

/* append "/lower2", "/lower3", and "/lower4" */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower2:/lower3:/lower4", 0);

/* turn index feature on */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "index", "on", 0);

/* append "/lower5" */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower5", 0);

Specifying ':' would have been rejected so this isn't a regression. And
we can't simply use "lowerdir=/lower" to append on top of existing
layers as "lowerdir=/lower,lowerdir=/other-lower" would make
"/other-lower" the only lower layer so we'd break uapi if we changed
this. So the ':' prefix seems a good compromise.

Users can choose to specify multiple layers at once or individual
layers. A layer is appended if it starts with ":". This requires that
the user has already added at least one layer before. If lowerdir is
specified again without a leading ":" then all previous layers are
dropped and replaced with the new layers. If lowerdir is specified and
empty than all layers are simply dropped.

An additional change is that overlayfs will now parse and resolve layers
right when they are specified in fsconfig() instead of deferring until
super block creation. This allows users to receive early errors.

It also allows users to actually use up to 500 layers something which
was theoretically possible but ended up not working due to the mount
option string passed via mount(2) being too large.

This also allows a more privileged process to set config options for a
lesser privileged process as the creds for fsconfig() and the creds for
fsopen() can differ. We could restrict that they match by enforcing that
the creds of fsopen() and fsconfig() match but I don't see why that
needs to be the case and allows for a good delegation mechanism.

Plus, in the future it means we're able to extend overlayfs mount
options and allow users to specify layers via file descriptors instead
of paths:

fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower1", dirfd);

/* append */
fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower2", dirfd);

/* append */
fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower3", dirfd);

/* clear all layers specified until now */
fsconfig(FSCONFIG_SET_STRING, "lowerdir", NULL, 0);

This would be especially nice if users create an overlayfs mount on top
of idmapped layers or just in general private mounts created via
open_tree(OPEN_TREE_CLONE). Those mounts would then never have to appear
anywhere in the filesystem. But for now just do the minimal thing.

We should probably aim to move more validation into ovl_fs_parse_param()
so users get errors before fsconfig(FSCONFIG_CMD_CREATE). But that can
be done in additional patches later.

This is now also rebased on top of the lazy lowerdata lookup which
allows the specificatin of data only layers using the new "::" syntax.

The rules are simple. A data only layers cannot be followed by any
regular layers and data layers must be preceeded by at least one regular
layer.

Parsing the lowerdir mount option must change because of this. The
original patchset used the old lowerdir parsing function to split a
lowerdir mount option string such as:

lowerdir=/lower1:/lower2::/lower3::/lower4

simply replacing each non-escaped ":" by "\0". So sequences of
non-escaped ":" were counted as layers. For example, the previous
lowerdir mount option above would've counted 6 layers instead of 4 and a
lowerdir mount option such as:

lowerdir="/lower1:/lower2::/lower3::/lower4:::::::::::::::::::::::::::"

would be counted as 33 layers. Other than being ugly this didn't matter
much because kern_path() would reject the first "\0" layer. However,
this overcounting of layers becomes problematic when we base allocations
on it where we very much only want to allocate space for 4 layers
instead of 33.

So the new parsing function rejects non-escaped sequences of colons
other than ":" and "::" immediately instead of relying on kern_path().

Link: https://github.com/util-linux/util-linux/issues/2287
Link: https://github.com/util-linux/util-linux/issues/1992
Link: https://bugs.archlinux.org/task/78702
Link: https://lore.kernel.org/linux-unionfs/20230530-klagen-zudem-32c0908c2108@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
diff 2b21da92 Wed Apr 26 15:21:46 MDT 2023 Amir Goldstein <amir73il@gmail.com> ovl: prepare to store lowerdata redirect for lazy lowerdata lookup

Prepare to allow ovl_lookup() to leave the last entry in a non-dir
lowerstack empty to signify lazy lowerdata lookup.

In this case, ovl_lookup() stores the redirect path from metacopy to
lowerdata in ovl_inode, which is going to be used later to perform the
lazy lowerdata lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2b1a7746 Fri Sep 23 23:00:00 MDT 2022 Miklos Szeredi <mszeredi@redhat.com> ovl: use vfs_tmpfile_open() helper

If tmpfile is used for copy up, then use this helper to create the tmpfile
and open it at the same time. This will later allow filesystems such as
fuse to do this operation atomically.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 2d343087 Thu Aug 04 11:11:15 MDT 2022 Al Viro <viro@zeniv.linux.org.uk> overlayfs: constify path

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 4a47c638 Wed Jul 06 07:56:11 MDT 2022 Christian Brauner <brauner@kernel.org> ovl: turn of SB_POSIXACL with idmapped layers temporarily

This cycle we added support for mounting overlayfs on top of idmapped
mounts. Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
mkdir -pv /vol/contpool/ctrover/{over,work}
chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
/var/lib/lxc/c2/rootfs \
/vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

mount -t overlay overlay \
-o lowerdir=/vol/contpool/lowermap, \
upperdir=/vol/contpool/overmap/over, \
workdir=/vol/contpool/overmap/work \
/vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

# file: vol/contpool/merge/home/ubuntu/.bashrc
# owner: 1000
# group: 1000
user::rw-
user:4294967295:rwx
group::r--
mask::rwx
other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container. IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() ________________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() _______________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
|-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { |
| V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(&init_user_ns, 10000004);
| } |_________________________________________________
| |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
}

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(0(&init_user_ns, 10000004);
| |_________________________________________________
| } |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
}

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

ovl_permission()
-> generic_permission()
-> acl_permission_check()
-> check_acl()
-> get_acl()
-> inode->i_op->get_acl() == ovl_get_acl()
> get_acl() /* on the underlying filesystem)
->inode->i_op->get_acl() == /*lower filesystem callback */
-> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a497c ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 4a47c638 Wed Jul 06 07:56:11 MDT 2022 Christian Brauner <brauner@kernel.org> ovl: turn of SB_POSIXACL with idmapped layers temporarily

This cycle we added support for mounting overlayfs on top of idmapped
mounts. Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
mkdir -pv /vol/contpool/ctrover/{over,work}
chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
/var/lib/lxc/c2/rootfs \
/vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

mount -t overlay overlay \
-o lowerdir=/vol/contpool/lowermap, \
upperdir=/vol/contpool/overmap/over, \
workdir=/vol/contpool/overmap/work \
/vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

# file: vol/contpool/merge/home/ubuntu/.bashrc
# owner: 1000
# group: 1000
user::rw-
user:4294967295:rwx
group::r--
mask::rwx
other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container. IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() ________________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() _______________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
|-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { |
| V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(&init_user_ns, 10000004);
| } |_________________________________________________
| |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
}

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(0(&init_user_ns, 10000004);
| |_________________________________________________
| } |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
}

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

ovl_permission()
-> generic_permission()
-> acl_permission_check()
-> check_acl()
-> get_acl()
-> inode->i_op->get_acl() == ovl_get_acl()
> get_acl() /* on the underlying filesystem)
->inode->i_op->get_acl() == /*lower filesystem callback */
-> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a497c ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff 4a47c638 Wed Jul 06 07:56:11 MDT 2022 Christian Brauner <brauner@kernel.org> ovl: turn of SB_POSIXACL with idmapped layers temporarily

This cycle we added support for mounting overlayfs on top of idmapped
mounts. Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
mkdir -pv /vol/contpool/ctrover/{over,work}
chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
/var/lib/lxc/c2/rootfs \
/vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

mount -t overlay overlay \
-o lowerdir=/vol/contpool/lowermap, \
upperdir=/vol/contpool/overmap/over, \
workdir=/vol/contpool/overmap/work \
/vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

# file: vol/contpool/merge/home/ubuntu/.bashrc
# owner: 1000
# group: 1000
user::rw-
user:4294967295:rwx
group::r--
mask::rwx
other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container. IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() ________________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() _______________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
|-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { |
| V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(&init_user_ns, 10000004);
| } |_________________________________________________
| |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
}

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(0(&init_user_ns, 10000004);
| |_________________________________________________
| } |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
}

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

ovl_permission()
-> generic_permission()
-> acl_permission_check()
-> check_acl()
-> get_acl()
-> inode->i_op->get_acl() == ovl_get_acl()
> get_acl() /* on the underlying filesystem)
->inode->i_op->get_acl() == /*lower filesystem callback */
-> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a497c ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
diff bb49e9e7 Thu Dec 02 16:16:58 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fs: add is_idmapped_mnt() helper

Multiple places open-code the same check to determine whether a given
mount is idmapped. Introduce a simple helper function that can be used
instead. This allows us to get rid of the fragile open-coding. We will
later change the check that is used to determine whether a given mount
is idmapped. Introducing a helper allows us to do this in a single
place instead of doing it for multiple places.

Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff bb49e9e7 Thu Dec 02 16:16:58 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fs: add is_idmapped_mnt() helper

Multiple places open-code the same check to determine whether a given
mount is idmapped. Introduce a simple helper function that can be used
instead. This allows us to get rid of the fragile open-coding. We will
later change the check that is used to determine whether a given mount
is idmapped. Introducing a helper allows us to do this in a single
place instead of doing it for multiple places.

Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff bb49e9e7 Thu Dec 02 16:16:58 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fs: add is_idmapped_mnt() helper

Multiple places open-code the same check to determine whether a given
mount is idmapped. Introduce a simple helper function that can be used
instead. This allows us to get rid of the fragile open-coding. We will
later change the check that is used to determine whether a given mount
is idmapped. Introducing a helper allows us to do this in a single
place instead of doing it for multiple places.

Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
/linux-master/fs/f2fs/
H A Dfile.cdiff 2f6d721e Tue Mar 05 20:47:46 MST 2024 Xiuhong Wang <xiuhong.wang@unisoc.com> f2fs: compress: fix reserve_cblocks counting error when out of space

When a file only needs one direct_node, performing the following
operations will cause the file to be unrepairable:

unisoc # ./f2fs_io compress test.apk
unisoc #df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.2M 100% /data

unisoc # ./f2fs_io release_cblocks test.apk
924
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 4.8M 100% /data

unisoc # dd if=/dev/random of=file4 bs=1M count=3
3145728 bytes (3.0 M) copied, 0.025 s, 120 M/s
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.8M 100% /data

unisoc # ./f2fs_io reserve_cblocks test.apk
F2FS_IOC_RESERVE_COMPRESS_BLOCKS failed: No space left on device

adb reboot
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 11M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
0

This is because the file has only one direct_node. After returning
to -ENOSPC, reserved_blocks += ret will not be executed. As a result,
the reserved_blocks at this time is still 0, which is not the real
number of reserved blocks. Therefore, fsck cannot be set to repair
the file.

After this patch, the fsck flag will be set to fix this problem.

unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.8M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
F2FS_IOC_RESERVE_COMPRESS_BLOCKS failed: No space left on device

adb reboot then fsck will be executed
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 11M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
924

Fixes: c75488fb4d82 ("f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS")
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 2fc2bcc8 Sun Feb 25 20:19:16 MST 2024 Chao Yu <chao@kernel.org> f2fs: fix to check return value in f2fs_insert_range()

In f2fs_insert_range(), it missed to check return value of
filemap_write_and_wait_range(), fix it.

Meanwhile, just return error number once __exchange_data_block()
fails.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4961acdd Sun Dec 10 04:35:42 MST 2023 Chao Yu <chao@kernel.org> f2fs: fix to tag gcing flag on page during block migration

It needs to add missing gcing flag on page during block migration,
in order to garantee migrated data be persisted during checkpoint,
otherwise out-of-order persistency between data and node may cause
data corruption after SPOR.

Similar issue was fixed by commit 2d1fe8a86bf5 ("f2fs: fix to tag
gcing flag on page during file defragment").

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 982c3b30 Tue Oct 24 07:01:08 MDT 2023 Christian Brauner <brauner@kernel.org> bdev: rename freeze and thaw helpers

We have bdev_mark_dead() etc and we're going to move block device
freezing to holder ops in the next patch. Make the naming consistent:

* freeze_bdev() -> bdev_freeze()
* thaw_bdev() -> bdev_thaw()

Also document the return code.

Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
diff 4ed33e69 Tue Oct 03 16:51:49 MDT 2023 Jaegeuk Kim <jaegeuk@kernel.org> f2fs: stop iterating f2fs_map_block if hole exists

Let's avoid unnecessary f2fs_map_block calls to load extents.

# f2fs_io fadvise willneed 0 4096 /data/local/tmp/test

f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 386, start blkaddr = 0x34ac00, len = 0x1400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 5506, start blkaddr = 0x34c200, len = 0x1000, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 9602, start blkaddr = 0x34d600, len = 0x1200, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 14210, start blkaddr = 0x34ec00, len = 0x400, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 15235, start blkaddr = 0x34f401, len = 0xbff, flags = 2,
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 18306, start blkaddr = 0x350200, len = 0x1200, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 22915, start blkaddr = 0x351601, len = 0xa7d, flags = 2
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25600, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25601, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 25602, start blkaddr = 0x351601, len = 0x0, flags = 0
...
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1037188, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1038206, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 1039224, start blkaddr = 0x351601, len = 0x0, flags = 0
f2fs_map_blocks: dev = (254,51), ino = 85845, file offset = 2075548, start blkaddr = 0x351601, len = 0x0, flags = 0

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
/linux-master/fs/
H A Dinode.cdiff 8b3d8381 Tue Jan 16 00:53:35 MST 2024 Baolin Wang <baolin.wang@linux.alibaba.com> fs: improve dump_mapping() robustness

We met a kernel crash issue when running stress-ng testing, and the
system crashes when printing the dentry name in dump_mapping().

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......

The root cause is that, one thread is doing page migration, and we will
use the target page's ->mapping field to save 'anon_vma' pointer between
page unmap and page move, and now the target page is locked and refcount
is 1.

Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers
that the refcount of this target page is 1, preventing the offline operation,
thus proceeding to dump the page. However, page_mapping() of the target
page may return an incorrect file mapping to crash the system in dump_mapping(),
since the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.

The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate:
fix getting incorrect page mapping during page migration"). In addition,
Matthew suggested we should also improve dump_mapping()'s robustness to
resilient against the kernel crash [1].

With checking the 'dentry.parent' and 'dentry.d_name.name' used by
dentry_name(), I can see dump_mapping() will output the invalid dentry
instead of crashing the system when this issue is reproduced again.

[12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07
[12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870
[12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff)
[12211.189150] page_type: 0xffffffff()
[12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0
[12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000
[12211.189155] page dumped because: unmovable page

[1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.1705391420.git.baolin.wang@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 8b3d8381 Tue Jan 16 00:53:35 MST 2024 Baolin Wang <baolin.wang@linux.alibaba.com> fs: improve dump_mapping() robustness

We met a kernel crash issue when running stress-ng testing, and the
system crashes when printing the dentry name in dump_mapping().

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......

The root cause is that, one thread is doing page migration, and we will
use the target page's ->mapping field to save 'anon_vma' pointer between
page unmap and page move, and now the target page is locked and refcount
is 1.

Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers
that the refcount of this target page is 1, preventing the offline operation,
thus proceeding to dump the page. However, page_mapping() of the target
page may return an incorrect file mapping to crash the system in dump_mapping(),
since the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.

The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate:
fix getting incorrect page mapping during page migration"). In addition,
Matthew suggested we should also improve dump_mapping()'s robustness to
resilient against the kernel crash [1].

With checking the 'dentry.parent' and 'dentry.d_name.name' used by
dentry_name(), I can see dump_mapping() will output the invalid dentry
instead of crashing the system when this issue is reproduced again.

[12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07
[12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870
[12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff)
[12211.189150] page_type: 0xffffffff()
[12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0
[12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000
[12211.189155] page dumped because: unmovable page

[1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.1705391420.git.baolin.wang@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 9d5b9475 Mon Nov 20 16:35:12 MST 2023 Joel Granados <j.granados@samsung.com> fs: Remove the now superfluous sentinel elements from ctl_table array

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel elements ctl_table struct. Special attention was placed in
making sure that an empty directory for fs/verity was created when
CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the
register sysctl call that expects a size.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
diff 0a97c01c Thu Nov 30 12:40:18 MST 2023 Nhat Pham <nphamcs@gmail.com> list_lru: allow explicit memcg and NUMA node selection

Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:

https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 0a97c01c Thu Nov 30 12:40:18 MST 2023 Nhat Pham <nphamcs@gmail.com> list_lru: allow explicit memcg and NUMA node selection

Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:

https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 762321da Wed Oct 25 08:10:17 MDT 2023 Christoph Hellwig <hch@lst.de> filemap: add a per-mapping stable writes flag

folio_wait_stable waits for writeback to finish before modifying the
contents of a folio again, e.g. to support check summing of the data
in the block integrity code.

Currently this behavior is controlled by the SB_I_STABLE_WRITES flag
on the super_block, which means it is uniform for the entire file system.
This is wrong for the block device pseudofs which is shared by all
block devices, or file systems that can use multiple devices like XFS
witht the RT subvolume or btrfs (although btrfs currently reimplements
folio_wait_stable anyway).

Add a per-address_space AS_STABLE_WRITES flag to control the behavior
in a more fine grained way. The existing SB_I_STABLE_WRITES is kept
to initialize AS_STABLE_WRITES to the existing default which covers
most cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231025141020.192413-2-hch@lst.de
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 16a94965 Wed Oct 04 12:52:38 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: convert core infrastructure to new timestamp accessors

Convert the core vfs code to use the new timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185239.80830-2-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 3e15dcf7 Fri Sep 08 07:28:59 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: rename __mnt_{want,drop}_write*() helpers

Before exporting these helpers to modules, make their names more
meaningful.

The names mnt_{get,put)_write_access*() were chosen, because they rhyme
with the inode {get,put)_write_access() helpers, which have a very close
meaning for the inode object.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230908132900.2983519-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 9b6304c1 Wed Jul 05 12:58:10 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: add ctime accessors infrastructure

struct timespec64 has unused bits in the tv_nsec field that can be used
for other purposes. In future patches, we're going to change how the
inode->i_ctime is accessed in certain inodes in order to make use of
them. In order to do that safely though, we'll need to eradicate raw
accesses of the inode->i_ctime field from the kernel.

Add new accessor functions for the ctime that we use to replace them.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Message-Id: <20230705185812.579118-2-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 33ab231f Mon Jul 03 08:49:12 MDT 2023 Christian Brauner <brauner@kernel.org> fs: don't assume arguments are non-NULL

The helper is explicitly documented as locking zero, one, or two
arguments. While all current callers do pass non-NULL arguments there's
no need or requirement for them to do so according to the code and the
unlock_two_nondirectories() helper is pretty clear about it as well. So
only call WARN_ON_ONCE() if the checked inode is valid.

Fixes: 2454ad83b90a ("fs: Restrict lock_two_nondirectories() to non-directory inodes")
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Message-Id: <20230703-vfs-rename-source-v1-2-37eebb29b65b@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
H A Dlibfs.cdiff ae62bcb5 Mon Oct 23 08:30:49 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: report f_fsid from s_dev for "simple" filesystems

There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).

In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).

(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).

(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.

The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.

9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.

Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().

Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff ae62bcb5 Mon Oct 23 08:30:49 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: report f_fsid from s_dev for "simple" filesystems

There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).

In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).

(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).

(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.

The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.

9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.

Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().

Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 8287474a Wed Sep 13 10:28:15 MDT 2023 Al Viro <viro@zeniv.linux.org.uk> direct_write_fallback(): on error revert the ->ki_pos update from buffered write

If we fail filemap_write_and_wait_range() on the range the buffered write went
into, we only report the "number of bytes which we direct-written", to quote
the comment in there. Which is fine, but buffered write has already advanced
iocb->ki_pos, so we need to roll that back. Otherwise we end up with e.g.
write(2) advancing position by more than the amount it reports having written.

Fixes: 182c25e9c157 "filemap: update ki_pos in generic_perform_write"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Message-Id: <20230827214518.GU3390869@ZenIV>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2be4f05a Tue Jul 25 12:31:04 MDT 2023 Chuck Lever <chuck.lever@oracle.com> libfs: Remove parent dentry locking in offset_iterate_dir()

Since offset_iterate_dir() does not walk the parent's d_subdir list
nor does it manipulate the parent's d_child, there doesn't seem to
be a reason to hold the parent's d_lock. The offset_ctx's xarray can
be sufficiently protected with just the RCU read lock.

Flame graph data captured during the git regression run shows a
20% reduction in CPU cycles consumed in offset_find_next().

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202307171640.e299f8d5-oliver.sang@intel.com
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Message-Id: <169030957098.157536.9938425508695693348.stgit@manet.1015granger.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2e41f274 Mon Sep 19 11:24:16 MDT 2022 Akinobu Mita <akinobu.mita@gmail.com> libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value

Patch series "fix error when writing negative value to simple attribute
files".

The simple attribute files do not accept a negative value since the commit
488dac0c9237 ("libfs: fix error cast of negative value in
simple_attr_write()"), but some attribute files want to accept a negative
value.


This patch (of 3):

The simple attribute files do not accept a negative value since the commit
488dac0c9237 ("libfs: fix error cast of negative value in
simple_attr_write()"), so we have to use a 64-bit value to write a
negative value.

This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value.

Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com
Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com
Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2e41f274 Mon Sep 19 11:24:16 MDT 2022 Akinobu Mita <akinobu.mita@gmail.com> libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value

Patch series "fix error when writing negative value to simple attribute
files".

The simple attribute files do not accept a negative value since the commit
488dac0c9237 ("libfs: fix error cast of negative value in
simple_attr_write()"), but some attribute files want to accept a negative
value.


This patch (of 3):

The simple attribute files do not accept a negative value since the commit
488dac0c9237 ("libfs: fix error cast of negative value in
simple_attr_write()"), so we have to use a 64-bit value to write a
negative value.

This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value.

Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com
Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com
Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 5298d4bf Mon Jan 17 23:56:14 MST 2022 Christoph Hellwig <hch@lst.de> unicode: clean up the Kconfig symbol confusion

Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d04787012 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
diff 6429e463 Thu Oct 28 03:47:21 MDT 2021 Lorenz Bauer <lmb@cloudflare.com> libfs: Move shmem_exchange to simple_rename_exchange

Move shmem_exchange and make it available to other callers.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20211028094724.59043-2-lmb@cloudflare.com
diff 59347d99 Mon Feb 15 21:29:27 MST 2021 Randy Dunlap <rdunlap@infradead.org> libfs: fix kernel-doc for mnt_userns

Fix kernel-doc warning in libfs.c.

../fs/libfs.c:498: warning: Function parameter or member 'mnt_userns' not described in 'simple_setattr'

Link: https://lore.kernel.org/r/20210216042929.8931-2-rdunlap@infradead.org/
Fixes: 549c7297717c ("fs: make helpers idmap mount aware")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
H A Dopen.cdiff 5f0d594c Fri Feb 02 05:17:23 MST 2024 Tony Solomonik <tony.solomonik@gmail.com> Add do_ftruncate that truncates a struct file

do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.

do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.

Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff 83bc1d29 Mon Oct 09 09:37:10 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: get mnt_writers count for an open backing file's real path

A writeable mapped backing file can perform writes to the real inode.
Therefore, the real path mount must be kept writable so long as the
writable map exists.

This may not be strictly needed for ovelrayfs private upper mount,
but it is correct to take the mnt_writers count in the vfs helper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 3e15dcf7 Fri Sep 08 07:28:59 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: rename __mnt_{want,drop}_write*() helpers

Before exporting these helpers to modules, make their names more
meaningful.

The names mnt_{get,put)_write_access*() were chosen, because they rhyme
with the inode {get,put)_write_access() helpers, which have a very close
meaning for the inode object.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230908132900.2983519-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 021a160a Tue Aug 08 11:26:35 MDT 2023 Linus Torvalds <torvalds@linux-foundation.org> fs: use __fput_sync in close(2)

close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.

Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).

Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)

Patched result is the same regardless of rseq as the codepath is avoided.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 021a160a Tue Aug 08 11:26:35 MDT 2023 Linus Torvalds <torvalds@linux-foundation.org> fs: use __fput_sync in close(2)

close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.

Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).

Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)

Patched result is the same regardless of rseq as the codepath is avoided.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 021a160a Tue Aug 08 11:26:35 MDT 2023 Linus Torvalds <torvalds@linux-foundation.org> fs: use __fput_sync in close(2)

close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.

Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).

Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)

Patched result is the same regardless of rseq as the codepath is avoided.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5daeb41a Fri Jul 28 05:58:26 MDT 2023 Aleksa Sarai <cyphar@cyphar.com> fchmodat2: add support for AT_EMPTY_PATH

This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.

The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):

> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.

However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
H A Dattr.cdiff 16a94965 Wed Oct 04 12:52:38 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: convert core infrastructure to new timestamp accessors

Convert the core vfs code to use the new timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185239.80830-2-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5d1f903f Wed Jul 12 12:58:49 MDT 2023 Christian Brauner <brauner@kernel.org> attr: block mode changes of symlinks

Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.

However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:

(1) Filesystems that don't implement a i_op->setattr() for symlinks.

Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.

That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.

Example: afs

(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.

Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.

The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via

if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.

When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.

Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:

* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl

Example: btrfs, ext4, xfs

The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:

fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);

char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);

But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.

So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.

If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.

But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.

We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.

Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5d1f903f Wed Jul 12 12:58:49 MDT 2023 Christian Brauner <brauner@kernel.org> attr: block mode changes of symlinks

Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.

However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:

(1) Filesystems that don't implement a i_op->setattr() for symlinks.

Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.

That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.

Example: afs

(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.

Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.

The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via

if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.

When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.

Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:

* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl

Example: btrfs, ext4, xfs

The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:

fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);

char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);

But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.

So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.

If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.

But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.

We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.

Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 5d1f903f Wed Jul 12 12:58:49 MDT 2023 Christian Brauner <brauner@kernel.org> attr: block mode changes of symlinks

Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.

However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:

(1) Filesystems that don't implement a i_op->setattr() for symlinks.

Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.

That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.

Example: afs

(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.

Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.

The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via

if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.

When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.

Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:

* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl

Example: btrfs, ext4, xfs

The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:

fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);

char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);

But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.

So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.

If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.

But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.

We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.

Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 4f704d9a Mon Mar 13 17:51:10 MDT 2023 Christian Brauner <brauner@kernel.org> nfs: use vfs setgid helper

We've aligned setgid behavior over multiple kernel releases. The details
can be found in the following two merge messages:
cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2')
426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0')
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Switch nfs to rely on this
helper as well. Without this patch the setgid stripping tests in
xfstests will fail.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff e2ebff9c Mon Aug 08 02:52:35 MDT 2022 David Howells <dhowells@redhat.com> vfs: Check the truncate maximum size in inode_newsize_ok()

If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.

Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:

(1) Cachefiles will round up the EOF of the backing file to DIO block
size so as to allow DIO on the final block - but this might push
the offset negative. It then calls notify_change(), but this
inadvertently bypasses the checking. This can be triggered if
someone puts an 8EiB-1 file on a server for someone else to try and
access by, say, nfs.

(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
and then calls vfs_truncate() directly - which also bypasses the
check.

In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.

nfsd is okay as it checks the value, but we can then remove this check
too.

Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.

Fixes: 1f08c925e7a3 ("cachefiles: Implement backing file wrangling")
Fixes: f44158485826 ("cifsd: add file operations")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: stable@kernel.org
Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 168f9128 Mon Jun 13 05:15:17 MDT 2022 Christian Brauner <brauner@kernel.org> fs: account for group membership

When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.

When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.

Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.

The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.

When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.

We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.

Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.

New regression test sent to xfstests.

Link: https://github.com/lxc/lxd/issues/10537
Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # 5.15+
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
diff 96821970 Tue Nov 09 07:57:12 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> fs: handle circular mappings correctly

When calling setattr_prepare() to determine the validity of the attributes the
ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id.
When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id
matches the current {g,u}id attribute the attribute change is allowed.

The value in ia_{g,u}id does already account for idmapped mounts and will have
taken the relevant idmapping into account. So in order to verify that the
{g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value
against the inode's i_{g,u}id value.

This only has any meaning for idmapped mounts as idmapping helpers are
idempotent without them. And for idmapped mounts this really only has a meaning
when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped
to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be
useful when sharing the same home directory between multiple users at the same
time.

As an example consider a directory with two files: /source/file1 owned by
{g,u}id 1000 and /source/file2 owned by {g,u}id 1001. Assume we create an
idmapped mount at /target with an idmapping that maps files owned by {g,u}id
1000 to being owned by {g,u}id 1001 and files owned by {g,u}id 1001 to being
owned by {g,u}id 1000. In effect, the idmapped mount at /target switches the
ownership of /source/file1 and source/file2, i.e. /target/file1 will be owned
by {g,u}id 1001 and /target/file2 will be owned by {g,u}id 1000.

This means that a user with fs{g,u}id 1000 must be allowed to setattr
/target/file2 from {g,u}id 1000 to {g,u}id 1000. Similar, a user with fs{g,u}id
1001 must be allowed to setattr /target/file1 from {g,u}id 1001 to {g,u}id
1001. Conversely, a user with fs{g,u}id 1000 must fail to setattr /target/file1
from {g,u}id 1001 to {g,u}id 1000. And a user with fs{g,u}id 1001 must fail to
setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Both cases must fail
with EPERM for non-capable callers.

Before this patch we could end up denying legitimate attribute changes and
allowing invalid attribute changes when circular mappings are used. To even get
into this situation the caller must've been privileged both to create that
mapping and to create that idmapped mount.

This hasn't been seen in the wild anywhere but came up when expanding the
testsuite during work on a series of hardening patches. All idmapped fstests
pass without any regressions and we add new tests to verify the behavior of
circular mappings.

Link: https://lore.kernel.org/r/20211109145713.1868404-1-brauner@kernel.org
Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts")
Cc: Seth Forshee <seth.forshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
/linux-master/fs/ext4/
H A Dinode.cdiff 3fcc2b88 Fri Jan 26 18:58:00 MST 2024 Zhang Yi <yi.zhang@huawei.com> ext4: refactor ext4_da_map_blocks()

Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary
parameters and branches, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2ffd2a6a Fri Jan 05 02:21:01 MST 2024 Kemeng Shi <shikemeng@huaweicloud.com> ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations

The "needed" controls the number of ext4_prealloc_space to discard in
ext4_discard_preallocations. Function ext4_discard_preallocations is
supposed to discard all non-used preallocated blocks when "needed"
is 0 and now ext4_discard_preallocations is always called with "needed"
= 0. Remove unnecessary parameter "needed" and remove all non-used
preallocated spaces in ext4_discard_preallocations to simplify the
code.

Note: If count of non-used preallocated spaces could be more than
UINT_MAX, there was a memory leak as some non-used preallocated
spaces are left ununsed and this commit will fix it. Otherwise,
there is no behavior change.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure

** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().

** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

* The write was extending i_size.
* The last block of the file was fallocated and had an unwritten extent
* We were near ENOSPC and hence we were switching to non-delayed alloc
allocation.

In this case, the back trace that triggers the bug is as follows:

ext4_da_write_begin()
/* switch to nodelalloc due to low space */
ext4_write_begin()
ext4_should_dioread_nolock() // true since mount flags still have delalloc
__block_write_begin(..., ext4_get_block_unwritten)
__block_write_begin_int()
for(each buffer head in page) {
/* first iteration, this is bh1 which contains i_size */
if (!buffer_mapped)
get_block() /* returns bh with only UNWRITTEN and MAPPED */
/* second iteration, bh2 */
if (!buffer_mapped)
get_block() /* we fail here, could be ENOSPC */
}
if (err)
/*
* this would zero out all new buffers and mark them uptodate.
* Since bh1 was never marked new, we skip it here which causes
* the bug later.
*/
folio_zero_new_buffers();
/* ext4_wrte_begin() error handling */
ext4_truncate_failed_write()
ext4_truncate()
ext4_block_truncate_page()
__ext4_block_zero_page_range()
if(!buffer_uptodate())
ext4_read_bh_lock()
ext4_read_bh() -> ... ext4_submit_bh_wbc()
BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reproducer:

fallocate -l 4k testfile
xfs_io -c "pwrite 2k 2k" testfile
# hexdump output will have stale data in from byte 0 to 2k in testfile
hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write happening via
ext4_write_begin(), which is usually used when we have -o nodealloc. Since
dioread_nolock is disabled with nodelalloc, the only alternate way to call
ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
This might make the issue harder to hit. Hence, for reliable replication, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and then
mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
if (ext4_should_journal_data(inode))
return 0;
/* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ // return 0;
return 1;
}

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure

** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().

** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

* The write was extending i_size.
* The last block of the file was fallocated and had an unwritten extent
* We were near ENOSPC and hence we were switching to non-delayed alloc
allocation.

In this case, the back trace that triggers the bug is as follows:

ext4_da_write_begin()
/* switch to nodelalloc due to low space */
ext4_write_begin()
ext4_should_dioread_nolock() // true since mount flags still have delalloc
__block_write_begin(..., ext4_get_block_unwritten)
__block_write_begin_int()
for(each buffer head in page) {
/* first iteration, this is bh1 which contains i_size */
if (!buffer_mapped)
get_block() /* returns bh with only UNWRITTEN and MAPPED */
/* second iteration, bh2 */
if (!buffer_mapped)
get_block() /* we fail here, could be ENOSPC */
}
if (err)
/*
* this would zero out all new buffers and mark them uptodate.
* Since bh1 was never marked new, we skip it here which causes
* the bug later.
*/
folio_zero_new_buffers();
/* ext4_wrte_begin() error handling */
ext4_truncate_failed_write()
ext4_truncate()
ext4_block_truncate_page()
__ext4_block_zero_page_range()
if(!buffer_uptodate())
ext4_read_bh_lock()
ext4_read_bh() -> ... ext4_submit_bh_wbc()
BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reproducer:

fallocate -l 4k testfile
xfs_io -c "pwrite 2k 2k" testfile
# hexdump output will have stale data in from byte 0 to 2k in testfile
hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write happening via
ext4_write_begin(), which is usually used when we have -o nodealloc. Since
dioread_nolock is disabled with nodelalloc, the only alternate way to call
ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
This might make the issue harder to hit. Hence, for reliable replication, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and then
mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
if (ext4_should_journal_data(inode))
return 0;
/* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ // return 0;
return 1;
}

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure

** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().

** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

* The write was extending i_size.
* The last block of the file was fallocated and had an unwritten extent
* We were near ENOSPC and hence we were switching to non-delayed alloc
allocation.

In this case, the back trace that triggers the bug is as follows:

ext4_da_write_begin()
/* switch to nodelalloc due to low space */
ext4_write_begin()
ext4_should_dioread_nolock() // true since mount flags still have delalloc
__block_write_begin(..., ext4_get_block_unwritten)
__block_write_begin_int()
for(each buffer head in page) {
/* first iteration, this is bh1 which contains i_size */
if (!buffer_mapped)
get_block() /* returns bh with only UNWRITTEN and MAPPED */
/* second iteration, bh2 */
if (!buffer_mapped)
get_block() /* we fail here, could be ENOSPC */
}
if (err)
/*
* this would zero out all new buffers and mark them uptodate.
* Since bh1 was never marked new, we skip it here which causes
* the bug later.
*/
folio_zero_new_buffers();
/* ext4_wrte_begin() error handling */
ext4_truncate_failed_write()
ext4_truncate()
ext4_block_truncate_page()
__ext4_block_zero_page_range()
if(!buffer_uptodate())
ext4_read_bh_lock()
ext4_read_bh() -> ... ext4_submit_bh_wbc()
BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reproducer:

fallocate -l 4k testfile
xfs_io -c "pwrite 2k 2k" testfile
# hexdump output will have stale data in from byte 0 to 2k in testfile
hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write happening via
ext4_write_begin(), which is usually used when we have -o nodealloc. Since
dioread_nolock is disabled with nodelalloc, the only alternate way to call
ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
This might make the issue harder to hit. Hence, for reliable replication, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and then
mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
if (ext4_should_journal_data(inode))
return 0;
/* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ // return 0;
return 1;
}

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure

** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().

** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

* The write was extending i_size.
* The last block of the file was fallocated and had an unwritten extent
* We were near ENOSPC and hence we were switching to non-delayed alloc
allocation.

In this case, the back trace that triggers the bug is as follows:

ext4_da_write_begin()
/* switch to nodelalloc due to low space */
ext4_write_begin()
ext4_should_dioread_nolock() // true since mount flags still have delalloc
__block_write_begin(..., ext4_get_block_unwritten)
__block_write_begin_int()
for(each buffer head in page) {
/* first iteration, this is bh1 which contains i_size */
if (!buffer_mapped)
get_block() /* returns bh with only UNWRITTEN and MAPPED */
/* second iteration, bh2 */
if (!buffer_mapped)
get_block() /* we fail here, could be ENOSPC */
}
if (err)
/*
* this would zero out all new buffers and mark them uptodate.
* Since bh1 was never marked new, we skip it here which causes
* the bug later.
*/
folio_zero_new_buffers();
/* ext4_wrte_begin() error handling */
ext4_truncate_failed_write()
ext4_truncate()
ext4_block_truncate_page()
__ext4_block_zero_page_range()
if(!buffer_uptodate())
ext4_read_bh_lock()
ext4_read_bh() -> ... ext4_submit_bh_wbc()
BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reproducer:

fallocate -l 4k testfile
xfs_io -c "pwrite 2k 2k" testfile
# hexdump output will have stale data in from byte 0 to 2k in testfile
hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write happening via
ext4_write_begin(), which is usually used when we have -o nodealloc. Since
dioread_nolock is disabled with nodelalloc, the only alternate way to call
ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
This might make the issue harder to hit. Hence, for reliable replication, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and then
mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
if (ext4_should_journal_data(inode))
return 0;
/* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ // return 0;
return 1;
}

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 2cd8bdb5 Mon Sep 18 04:45:50 MDT 2023 Ojaswin Mujoo <ojaswin@linux.ibm.com> ext4: mark buffer new if it is unwritten to avoid stale data exposure

** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().

** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

* The write was extending i_size.
* The last block of the file was fallocated and had an unwritten extent
* We were near ENOSPC and hence we were switching to non-delayed alloc
allocation.

In this case, the back trace that triggers the bug is as follows:

ext4_da_write_begin()
/* switch to nodelalloc due to low space */
ext4_write_begin()
ext4_should_dioread_nolock() // true since mount flags still have delalloc
__block_write_begin(..., ext4_get_block_unwritten)
__block_write_begin_int()
for(each buffer head in page) {
/* first iteration, this is bh1 which contains i_size */
if (!buffer_mapped)
get_block() /* returns bh with only UNWRITTEN and MAPPED */
/* second iteration, bh2 */
if (!buffer_mapped)
get_block() /* we fail here, could be ENOSPC */
}
if (err)
/*
* this would zero out all new buffers and mark them uptodate.
* Since bh1 was never marked new, we skip it here which causes
* the bug later.
*/
folio_zero_new_buffers();
/* ext4_wrte_begin() error handling */
ext4_truncate_failed_write()
ext4_truncate()
ext4_block_truncate_page()
__ext4_block_zero_page_range()
if(!buffer_uptodate())
ext4_read_bh_lock()
ext4_read_bh() -> ... ext4_submit_bh_wbc()
BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reproducer:

fallocate -l 4k testfile
xfs_io -c "pwrite 2k 2k" testfile
# hexdump output will have stale data in from byte 0 to 2k in testfile
hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write happening via
ext4_write_begin(), which is usually used when we have -o nodealloc. Since
dioread_nolock is disabled with nodelalloc, the only alternate way to call
ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
This might make the issue harder to hit. Hence, for reliable replication, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and then
mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
if (ext4_should_journal_data(inode))
return 0;
/* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ // return 0;
return 1;
}

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 8216776c Mon Aug 14 12:29:01 MDT 2023 Eric Biggers <ebiggers@google.com> ext4: reject casefold inode flag without casefold feature

It is invalid for the casefold inode flag to be set without the casefold
superblock feature flag also being set. e2fsck already considers this
case to be invalid and handles it by offering to clear the casefold flag
on the inode. __ext4_iget() also already considered this to be invalid,
sort of, but it only got so far as logging an error message; it didn't
actually reject the inode. Make it reject the inode so that other code
doesn't have to handle this case. This matches what f2fs does.

Note: we could check 's_encoding != NULL' instead of
ext4_has_feature_casefold(). This would make the check robust against
the casefold feature being enabled by userspace writing to the page
cache of the mounted block device. However, it's unsolvable in general
for filesystems to be robust against concurrent writes to the page cache
of the mounted block device. Though this very particular scenario
involving the casefold feature is solvable, we should not pretend that
we can support this model, so let's just check the casefold feature.
tune2fs already forbids enabling casefold on a mounted filesystem.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 03de20be Thu Aug 10 09:43:33 MDT 2023 Liu Song <liusong@linux.alibaba.com> ext4: do not mark inode dirty every time when appending using delalloc

In the delalloc append write scenario, if inode's i_size is extended due
to buffer write, there are delalloc writes pending in the range up to
i_size, and no need to touch i_disksize since writeback will push
i_disksize up to i_size eventually. Offers significant performance
improvement in high-frequency append write scenarios.

I conducted tests in my 32-core environment by launching 32 concurrent
threads to append write to the same file. Each write operation had a
length of 1024 bytes and was repeated 100000 times. Without using this
patch, the test was completed in 7705 ms. However, with this patch, the
test was completed in 5066 ms, resulting in a performance improvement of
34%.

Moreover, in test scenarios of Kafka version 2.6.2, using packet size of
2K, with this patch resulted in a 10% performance improvement.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
diff 03de20be Thu Aug 10 09:43:33 MDT 2023 Liu Song <liusong@linux.alibaba.com> ext4: do not mark inode dirty every time when appending using delalloc

In the delalloc append write scenario, if inode's i_size is extended due
to buffer write, there are delalloc writes pending in the range up to
i_size, and no need to touch i_disksize since writeback will push
i_disksize up to i_size eventually. Offers significant performance
improvement in high-frequency append write scenarios.

I conducted tests in my 32-core environment by launching 32 concurrent
threads to append write to the same file. Each write operation had a
length of 1024 bytes and was repeated 100000 times. Without using this
patch, the test was completed in 7705 ms. However, with this patch, the
test was completed in 5066 ms, resulting in a performance improvement of
34%.

Moreover, in test scenarios of Kafka version 2.6.2, using packet size of
2K, with this patch resulted in a 10% performance improvement.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
/linux-master/fs/xfs/
H A Dxfs_iops.cdiff e014f37d Tue Mar 08 11:51:16 MST 2022 Darrick J. Wong <djwong@kernel.org> xfs: use setattr_copy to set vfs inode attributes

Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
revocation isn't consistent with btrfs[1] or ext4. Those two
filesystems use the VFS function setattr_copy to convey certain
attributes from struct iattr into the VFS inode structure.

Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
decide if it should clear setgid and setuid on a file attribute update.
This is a second symptom of the problem that Filipe noticed.

XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
xfs_setattr_nonsize, and xfs_setattr_time. Regrettably, setattr_copy is
/not/ a simple copy function; it contains additional logic to clear the
setgid bit when setting the mode, and XFS' version no longer matches.

The VFS implements its own setuid/setgid stripping logic, which
establishes consistent behavior. It's a tad unfortunate that it's
scattered across notify_change, should_remove_suid, and setattr_copy but
XFS should really follow the Linux VFS. Adapt XFS to use the VFS
functions and get rid of the old functions.

[1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/

Fixes: 7fa294c8991c ("userns: Allow chown and setgid preservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
diff e014f37d Tue Mar 08 11:51:16 MST 2022 Darrick J. Wong <djwong@kernel.org> xfs: use setattr_copy to set vfs inode attributes

Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
revocation isn't consistent with btrfs[1] or ext4. Those two
filesystems use the VFS function setattr_copy to convey certain
attributes from struct iattr into the VFS inode structure.

Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
decide if it should clear setgid and setuid on a file attribute update.
This is a second symptom of the problem that Filipe noticed.

XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
xfs_setattr_nonsize, and xfs_setattr_time. Regrettably, setattr_copy is
/not/ a simple copy function; it contains additional logic to clear the
setgid bit when setting the mode, and XFS' version no longer matches.

The VFS implements its own setuid/setgid stripping logic, which
establishes consistent behavior. It's a tad unfortunate that it's
scattered across notify_change, should_remove_suid, and setattr_copy but
XFS should really follow the Linux VFS. Adapt XFS to use the VFS
functions and get rid of the old functions.

[1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/

Fixes: 7fa294c8991c ("userns: Allow chown and setgid preservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
diff 2e973b2c Wed Aug 18 19:46:52 MDT 2021 Dave Chinner <dchinner@redhat.com> xfs: convert remaining mount flags to state flags

The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.

Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 38c26bfd Wed Aug 18 19:46:37 MDT 2021 Dave Chinner <dchinner@redhat.com> xfs: replace xfs_sb_version checks with feature flag checks

Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.

Large parts of this conversion were done with sed with commands like
this:

for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done

With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.

The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:

$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 1b9598c8 Fri Mar 01 09:14:57 MST 2019 Luis R. Rodriguez <mcgrof@kernel.org> xfs: fix reporting supported extra file attributes for statx()

statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3d42d ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.

$ git describe --contains 5f955f26f3d42d04aba65590a32eb70eedb7f37d
v4.11-rc6~5^2^2~2

Fixes: 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 1b9598c8 Fri Mar 01 09:14:57 MST 2019 Luis R. Rodriguez <mcgrof@kernel.org> xfs: fix reporting supported extra file attributes for statx()

statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3d42d ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.

$ git describe --contains 5f955f26f3d42d04aba65590a32eb70eedb7f37d
v4.11-rc6~5^2^2~2

Fixes: 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 1b9598c8 Fri Mar 01 09:14:57 MST 2019 Luis R. Rodriguez <mcgrof@kernel.org> xfs: fix reporting supported extra file attributes for statx()

statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3d42d ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.

$ git describe --contains 5f955f26f3d42d04aba65590a32eb70eedb7f37d
v4.11-rc6~5^2^2~2

Fixes: 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff 1b9598c8 Fri Mar 01 09:14:57 MST 2019 Luis R. Rodriguez <mcgrof@kernel.org> xfs: fix reporting supported extra file attributes for statx()

statx(2) notes that any attribute that is not indicated as supported by
stx_attributes_mask has no usable value. Commit 5f955f26f3d42d ("xfs: report
crtime and attribute flags to statx") added support for informing userspace
of extra file attributes but forgot to list these flags as supported
making reporting them rather useless for the pedantic userspace author.

$ git describe --contains 5f955f26f3d42d04aba65590a32eb70eedb7f37d
v4.11-rc6~5^2^2~2

Fixes: 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx")
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment reminding people to keep attributes_mask up to date]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
/linux-master/fs/ocfs2/
H A Dfile.cdiff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2cb1e089 Mon May 22 07:50:15 MDT 2023 David Howells <dhowells@redhat.com> splice: Use filemap_splice_read() instead of generic_file_splice_read()

Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff 839b6386 Fri Nov 05 14:34:55 MDT 2021 Jan Kara <jack@suse.cz> ocfs2: fix data corruption on truncate

Patch series "ocfs2: Truncate data corruption fix".

As further testing has shown, commit 5314454ea3f ("ocfs2: fix data
corruption after conversion from inline format") didn't fix all the data
corruption issues the customer started observing after 6dbf7bb55598
("fs: Don't invalidate page buffers in block_write_full_page()") This
time I have tracked them down to two bugs in ocfs2 truncation code.

One bug (truncating page cache before clearing tail cluster and setting
i_size) could cause data corruption even before 6dbf7bb55598, but before
that commit it needed a race with page fault, after 6dbf7bb55598 it
started to be pretty deterministic.

Another bug (zeroing pages beyond old i_size) used to be harmless
inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598
in combination with the first bug it resulted in deterministic data
corruption.

Although fixing only the first problem is needed to stop data
corruption, I've fixed both issues to make the code more robust.

This patch (of 2):

ocfs2_truncate_file() did unmap invalidate page cache pages before
zeroing partial tail cluster and setting i_size. Thus some pages could
be left (and likely have left if the cluster zeroing happened) in the
page cache beyond i_size after truncate finished letting user possibly
see stale data once the file was extended again. Also the tail cluster
zeroing was not guaranteed to finish before truncate finished causing
possible stale data exposure. The problem started to be particularly
easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers
in block_write_full_page()" stopped invalidation of pages beyond i_size
from page writeback path.

Fix these problems by unmapping and invalidating pages in the page cache
after the i_size is reduced and tail cluster is zeroed out.

Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 9449ad33 Thu Jul 29 15:53:41 MDT 2021 Junxiao Bi <junxiao.bi@oracle.com> ocfs2: issue zeroout to EOF blocks

For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster. But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't. The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again. That made
writeback ignore them forever. That will cause data corruption. Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more. The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
660 fallocate(11, 0, 2275868672, 327680) = 0
658 pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 9449ad33 Thu Jul 29 15:53:41 MDT 2021 Junxiao Bi <junxiao.bi@oracle.com> ocfs2: issue zeroout to EOF blocks

For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster. But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't. The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again. That made
writeback ignore them forever. That will cause data corruption. Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more. The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
660 fallocate(11, 0, 2275868672, 327680) = 0
658 pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2b5f52c5 Wed Apr 07 06:36:44 MDT 2021 Miklos Szeredi <mszeredi@redhat.com> ocfs2: convert to fileattr

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Joel Becker <jlbec@evilplan.org>
diff 90bd070a Fri Apr 09 14:27:29 MDT 2021 Wengang Wang <wen.gang.wang@oracle.com> ocfs2: fix deadlock between setattr and dio_end_io_write

The following deadlock is detected:

truncate -> setattr path is waiting for pending direct IO to be done (inode->i_dio_count become zero) with inode->i_rwsem held (down_write).

PID: 14827 TASK: ffff881686a9af80 CPU: 20 COMMAND: "ora_p005_hrltd9"
#0 __schedule at ffffffff818667cc
#1 schedule at ffffffff81866de6
#2 inode_dio_wait at ffffffff812a2d04
#3 ocfs2_setattr at ffffffffc05f322e [ocfs2]
#4 notify_change at ffffffff812a5a09
#5 do_truncate at ffffffff812808f5
#6 do_sys_ftruncate.constprop.18 at ffffffff81280cf2
#7 sys_ftruncate at ffffffff81280d8e
#8 do_syscall_64 at ffffffff81003949
#9 entry_SYSCALL_64_after_hwframe at ffffffff81a001ad

dio completion path is going to complete one direct IO (decrement
inode->i_dio_count), but before that it hung at locking inode->i_rwsem:

#0 __schedule+700 at ffffffff818667cc
#1 schedule+54 at ffffffff81866de6
#2 rwsem_down_write_failed+536 at ffffffff8186aa28
#3 call_rwsem_down_write_failed+23 at ffffffff8185a1b7
#4 down_write+45 at ffffffff81869c9d
#5 ocfs2_dio_end_io_write+180 at ffffffffc05d5444 [ocfs2]
#6 ocfs2_dio_end_io+85 at ffffffffc05d5a85 [ocfs2]
#7 dio_complete+140 at ffffffff812c873c
#8 dio_aio_complete_work+25 at ffffffff812c89f9
#9 process_one_work+361 at ffffffff810b1889
#10 worker_thread+77 at ffffffff810b233d
#11 kthread+261 at ffffffff810b7fd5
#12 ret_from_fork+62 at ffffffff81a0035e

Thus above forms ABBA deadlock. The same deadlock was mentioned in
upstream commit 28f5a8a7c033 ("ocfs2: should wait dio before inode lock
in ocfs2_setattr()"). It seems that that commit only removed the
cluster lock (the victim of above dead lock) from the ABBA deadlock
party.

End-user visible effects: Process hang in truncate -> ocfs2_setattr path
and other processes hang at ocfs2_dio_end_io_write path.

This is to fix the deadlock itself. It removes inode_lock() call from
dio completion path to remove the deadlock and add ip_alloc_sem lock in
setattr path to synchronize the inode modifications.

[wen.gang.wang@oracle.com: remove the "had_alloc_lock" as suggested]
Link: https://lkml.kernel.org/r/20210402171344.1605-1-wen.gang.wang@oracle.com

Link: https://lkml.kernel.org/r/20210331203654.3911-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 90bd070a Fri Apr 09 14:27:29 MDT 2021 Wengang Wang <wen.gang.wang@oracle.com> ocfs2: fix deadlock between setattr and dio_end_io_write

The following deadlock is detected:

truncate -> setattr path is waiting for pending direct IO to be done (inode->i_dio_count become zero) with inode->i_rwsem held (down_write).

PID: 14827 TASK: ffff881686a9af80 CPU: 20 COMMAND: "ora_p005_hrltd9"
#0 __schedule at ffffffff818667cc
#1 schedule at ffffffff81866de6
#2 inode_dio_wait at ffffffff812a2d04
#3 ocfs2_setattr at ffffffffc05f322e [ocfs2]
#4 notify_change at ffffffff812a5a09
#5 do_truncate at ffffffff812808f5
#6 do_sys_ftruncate.constprop.18 at ffffffff81280cf2
#7 sys_ftruncate at ffffffff81280d8e
#8 do_syscall_64 at ffffffff81003949
#9 entry_SYSCALL_64_after_hwframe at ffffffff81a001ad

dio completion path is going to complete one direct IO (decrement
inode->i_dio_count), but before that it hung at locking inode->i_rwsem:

#0 __schedule+700 at ffffffff818667cc
#1 schedule+54 at ffffffff81866de6
#2 rwsem_down_write_failed+536 at ffffffff8186aa28
#3 call_rwsem_down_write_failed+23 at ffffffff8185a1b7
#4 down_write+45 at ffffffff81869c9d
#5 ocfs2_dio_end_io_write+180 at ffffffffc05d5444 [ocfs2]
#6 ocfs2_dio_end_io+85 at ffffffffc05d5a85 [ocfs2]
#7 dio_complete+140 at ffffffff812c873c
#8 dio_aio_complete_work+25 at ffffffff812c89f9
#9 process_one_work+361 at ffffffff810b1889
#10 worker_thread+77 at ffffffff810b233d
#11 kthread+261 at ffffffff810b7fd5
#12 ret_from_fork+62 at ffffffff81a0035e

Thus above forms ABBA deadlock. The same deadlock was mentioned in
upstream commit 28f5a8a7c033 ("ocfs2: should wait dio before inode lock
in ocfs2_setattr()"). It seems that that commit only removed the
cluster lock (the victim of above dead lock) from the ABBA deadlock
party.

End-user visible effects: Process hang in truncate -> ocfs2_setattr path
and other processes hang at ocfs2_dio_end_io_write path.

This is to fix the deadlock itself. It removes inode_lock() call from
dio completion path to remove the deadlock and add ip_alloc_sem lock in
setattr path to synchronize the inode modifications.

[wen.gang.wang@oracle.com: remove the "had_alloc_lock" as suggested]
Link: https://lkml.kernel.org/r/20210402171344.1605-1-wen.gang.wang@oracle.com

Link: https://lkml.kernel.org/r/20210331203654.3911-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
/linux-master/fs/nilfs2/
H A Dinode.cdiff 21a87d88 Sat Oct 01 09:08:04 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()

If the i_mode field in inode of metadata files is corrupted on disk, it
can cause the initialization of bmap structure, which should have been
called from nilfs_read_inode_common(), not to be called. This causes a
lockdep warning followed by a NULL pointer dereference at
nilfs_bmap_lookup_at_level().

This patch fixes these issues by adding a missing sanitiy check for the
i_mode field of metadata file's inode.

Link: https://lkml.kernel.org/r/20221002030804.29978-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+2b32eb36c1a825b7a74c@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff e897be17 Fri Apr 01 12:28:18 MDT 2022 Ryusuke Konishi <konishi.ryusuke@gmail.com> nilfs2: fix lockdep warnings in page operations for btree nodes

Patch series "nilfs2 lockdep warning fixes".

The first two are to resolve the lockdep warning issue, and the last one
is the accompanying cleanup and low priority.

Based on your comment, this series solves the issue by separating inode
object as needed. Since I was worried about the impact of the object
composition changes, I tested the series carefully not to cause
regressions especially for delicate functions such like disk space
reclamation and snapshots.

This patch (of 3):

If CONFIG_LOCKDEP is enabled, nilfs2 hits lockdep warnings at
inode_to_wb() during page/folio operations for btree nodes:

WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 inode_to_wb include/linux/backing-dev.h:269 [inline]
WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 folio_account_dirtied mm/page-writeback.c:2460 [inline]
WARNING: CPU: 0 PID: 6575 at include/linux/backing-dev.h:269 __folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509
Modules linked in:
...
RIP: 0010:inode_to_wb include/linux/backing-dev.h:269 [inline]
RIP: 0010:folio_account_dirtied mm/page-writeback.c:2460 [inline]
RIP: 0010:__folio_mark_dirty+0xa7c/0xe30 mm/page-writeback.c:2509
...
Call Trace:
__set_page_dirty include/linux/pagemap.h:834 [inline]
mark_buffer_dirty+0x4e6/0x650 fs/buffer.c:1145
nilfs_btree_propagate_p fs/nilfs2/btree.c:1889 [inline]
nilfs_btree_propagate+0x4ae/0xea0 fs/nilfs2/btree.c:2085
nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337
nilfs_collect_dat_data+0x45/0xd0 fs/nilfs2/segment.c:625
nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1009
nilfs_segctor_scan_file+0x47a/0x700 fs/nilfs2/segment.c:1048
nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1224 [inline]
nilfs_segctor_collect fs/nilfs2/segment.c:1494 [inline]
nilfs_segctor_do_construct+0x14f3/0x6c60 fs/nilfs2/segment.c:2036
nilfs_segctor_construct+0x7a7/0xb30 fs/nilfs2/segment.c:2372
nilfs_segctor_thread_construct fs/nilfs2/segment.c:2480 [inline]
nilfs_segctor_thread+0x3c3/0xf90 fs/nilfs2/segment.c:2563
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

This is because nilfs2 uses two page caches for each inode and
inode->i_mapping never points to one of them, the btree node cache.

This causes inode_to_wb(inode) to refer to a different page cache than
the caller page/folio operations such like __folio_start_writeback(),
__folio_end_writeback(), or __folio_mark_dirty() acquired the lock.

This patch resolves the issue by allocating and using an additional
inode to hold the page cache of btree nodes. The inode is attached
one-to-one to the traditional nilfs2 inode if it requires a block
mapping with b-tree. This setup change is in memory only and does not
affect the disk format.

Link: https://lkml.kernel.org/r/1647867427-30498-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1647867427-30498-2-git-send-email-konishi.ryusuke@gmail.com
Link: https://lore.kernel.org/r/YXrYvIo8YRnAOJCj@casper.infradead.org
Link: https://lore.kernel.org/r/9a20b33d-b38f-b4a2-4742-c1eb5b8e4d6c@redhat.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+0d5b462a6f07447991b3@syzkaller.appspotmail.com
Reported-by: syzbot+34ef28bb2aeb28724aa0@syzkaller.appspotmail.com
Reported-by: Hao Sun <sunhao.th@gmail.com>
Reported-by: David Hildenbrand <david@redhat.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 1b0e3186 Tue Aug 11 19:35:43 MDT 2020 Eric Biggers <ebiggers@google.com> nilfs2: only call unlock_new_inode() if I_NEW

Patch series "nilfs2 updates".

This patch (of 3):

unlock_new_inode() is only meant to be called after a new inode has
already been inserted into the hash table. But nilfs_new_inode() can call
it even before it has inserted the inode, triggering the WARNING in
unlock_new_inode(). Fix this by only calling unlock_new_inode() if the
inode has the I_NEW flag set, indicating that it's in the table.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1595860111-3920-1-git-send-email-konishi.ryusuke@gmail.com
Link: http://lkml.kernel.org/r/1595860111-3920-2-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 06f4abf6 Mon May 23 17:23:31 MDT 2016 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: do not emit extra newline on nilfs_warning() and nilfs_error()

This updates call sites of nilfs_warning() and nilfs_error() so that they
don't add a duplicate newline. These output functions are already
designed to add a trailing newline to the message.

Link: http://lkml.kernel.org/r/1462886671-3521-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2d19961d Mon May 23 17:23:20 MDT 2016 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: move cleanup code of metadata file from inode routines

Refactor nilfs_clear_inode() and nilfs_i_callback() so that cleanup
code or resource deallocation related to metadata file will be moved
out to mdt.c.

Link: http://lkml.kernel.org/r/1461935747-10380-9-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2b0143b5 Tue Mar 17 16:25:59 MDT 2015 David Howells <dhowells@redhat.com> VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 136e8770 Fri May 24 16:55:29 MDT 2013 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary

nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
"The machine I've been trialling nilfs on is running Debian Testing,
Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
also reproduced it (identically) with Debian Unstable amd64 and Debian
Experimental (using the 3.8-trunk kernel). The problematic partitions
were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

[63102.496756] nilfs_direct_assign: invalid pointer: 0
[63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
[63102.496798]
[63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes. As
it is possible to see from above output, the file has 60 bytes of the
whole size. So, it is enough one block (1 KB in size) allocation for
the whole file. Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called. However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/fs/ext2/
H A Dinode.cdiff 2ebc736c Thu Aug 17 13:59:25 MDT 2023 Georg Ottinger <g.ottinger@gmx.at> ext2: improve consistency of ext2_fsblk_t datatype usage

The ext2 block allocation/deallocation functions and their respective
calls use a mixture of unsigned long and ext2_fsblk_t datatypes to
index the desired ext2 block. This commit replaces occurrences of
unsigned long with ext2_fsblk_t, covering the functions
ext2_new_block(), ext2_new_blocks(), ext2_free_blocks(),
ext2_free_data() and ext2_free_branches(). This commit is rather
conservative, and only replaces unsigned long with ext2_fsblk_t if
the variable is used to index a specific ext2 block.

Signed-off-by: Georg Ottinger <g.ottinger@gmx.at>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230817195925.10268-1-g.ottinger@gmx.at>
diff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff fcced95b Fri Apr 21 03:46:11 MDT 2023 Ritesh Harjani (IBM) <ritesh.list@gmail.com> ext2/dax: Fix ext2_setsize when len is page aligned

PAGE_ALIGN(x) macro gives the next highest value which is multiple of
pagesize. But if x is already page aligned then it simply returns x.
So, if x passed is 0 in dax_zero_range() function, that means the
length gets passed as 0 to ->iomap_begin().

In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on
here in ext2_get_blocks().
BUG_ON(maxblocks == 0);

Instead we should be calling dax_truncate_page() here which takes
care of it. i.e. it only calls dax_zero_range if the offset is not
page/block aligned.

This can be easily triggered with following on fsdax mounted pmem
device.

dd if=/dev/zero of=file count=1 bs=512
truncate -s 0 file

[79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff)
[93.793207] ------------[ cut here ]------------
[93.795102] kernel BUG at fs/ext2/inode.c:637!
[93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139
[93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610
<...>
[93.835298] Call Trace:
[93.836253] <TASK>
[93.837103] ? lock_acquire+0xf8/0x110
[93.838479] ? d_lookup+0x69/0xd0
[93.839779] ext2_iomap_begin+0xa7/0x1c0
[93.841154] iomap_iter+0xc7/0x150
[93.842425] dax_zero_range+0x6e/0xa0
[93.843813] ext2_setsize+0x176/0x1b0
[93.845164] ext2_setattr+0x151/0x200
[93.846467] notify_change+0x341/0x4e0
[93.847805] ? lock_acquire+0xf8/0x110
[93.849143] ? do_truncate+0x74/0xe0
[93.850452] ? do_truncate+0x84/0xe0
[93.851739] do_truncate+0x84/0xe0
[93.852974] do_sys_ftruncate+0x2b4/0x2f0
[93.854404] do_syscall_64+0x3f/0x90
[93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc

CC: stable@vger.kernel.org
Fixes: 2aa3048e03d3 ("iomap: switch iomap_zero_range to use iomap_iter")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <046a58317f29d9603d1068b2bbae47c2332c17ae.1682069716.git.ritesh.list@gmail.com>
diff 2aab03b8 Sun Jun 05 06:55:09 MDT 2022 Xiang wangx <wangxiang@cdjrlc.com> fs: Fix syntax errors in comments

Delete the redundant word 'not'.

Link: https://lore.kernel.org/r/20220605125509.14837-1-wangxiang@cdjrlc.com
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
diff 9583db23 Tue Jul 20 07:33:38 MDT 2021 Christoph Hellwig <hch@lst.de> ext2: make ext2_iomap_ops available unconditionally

ext2_iomap_ops will be used for the FIEMAP support going forward,
so make it available unconditionally.

Link: https://lore.kernel.org/r/20210720133341.405438-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 5aa1437d Thu May 17 15:18:30 MDT 2018 Al Viro <viro@zeniv.linux.org.uk> ext2: fix a block leak

open file, unlink it, then use ioctl(2) to make it immutable or
append only. Now close it and watch the blocks *not* freed...

Immutable/append-only checks belong in ->setattr().
Note: the bug is old and backport to anything prior to 737f2e93b972
("ext2: convert to use the new truncate convention") will need
these checks lifted into ext2_setattr().

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
/linux-master/fs/ocfs2/dlmfs/
H A Ddlmfs.cdiff e22c3872 Thu Aug 31 13:38:27 MDT 2023 Azeem Shaikh <azeemshaikh38@gmail.com> fs: ocfs2: replace strlcpy with sysfs_emit

strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1]. In an effort
to remove strlcpy() completely [2], replace strlcpy() here with
sysfs_emit().

Direct replacement is safe here since its ok for `kernel_param_ops.get()`
to return -errno [3].

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
[3] https://elixir.bootlin.com/linux/v6.5/source/include/linux/moduleparam.h#L52

Link: https://lkml.kernel.org/r/20230831193827.1528867-1-azeemshaikh38@gmail.com
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff e22c3872 Thu Aug 31 13:38:27 MDT 2023 Azeem Shaikh <azeemshaikh38@gmail.com> fs: ocfs2: replace strlcpy with sysfs_emit

strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1]. In an effort
to remove strlcpy() completely [2], replace strlcpy() here with
sysfs_emit().

Direct replacement is safe here since its ok for `kernel_param_ops.get()`
to return -errno [3].

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
[3] https://elixir.bootlin.com/linux/v6.5/source/include/linux/moduleparam.h#L52

Link: https://lkml.kernel.org/r/20230831193827.1528867-1-azeemshaikh38@gmail.com
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 328970de Thu May 23 16:04:05 MDT 2019 Thomas Gleixner <tglx@linutronix.de> treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 145

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 021110 1307 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 84 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190524100844.756442981@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff a9a08845 Sun Feb 11 15:34:03 MST 2018 Linus Torvalds <torvalds@linux-foundation.org> vfs: do bulk POLL* -> EPOLL* replacement

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2b0143b5 Tue Mar 17 16:25:59 MDT 2015 David Howells <dhowells@redhat.com> VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff ff8fb335 Tue Jan 21 16:48:20 MST 2014 Goldwyn Rodrigues <rgoldwyn@suse.de> ocfs2: remove versioning information

The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 2b15ad06 Sun Jul 24 11:43:19 MDT 2011 Al Viro <viro@zeniv.linux.org.uk> dlmfs: use inode_init_owner()

don't open-code it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 0016eedc Sat Jan 30 05:33:50 MST 2010 Joel Becker <joel.becker@oracle.com> ocfs2_dlmfs: Use the stackglue.

Rather than directly using o2dlm, dlmfs can now use the stackglue. This
allows it to use userspace cluster stacks and fs/dlm. This commit
forces o2cb for now. A latter commit will bump the protocol version and
allow non-o2cb stacks.

This is one big sed, really. LKM_xxMODE becomes DLM_LOCK_xx. LKM_flag
becomes DLM_LKF_flag.

We also learn to check that the LVB is valid before reading it. Any DLM
can lose the contents of the LVB during a complicated recovery. userdlm
should be checking this. Now it does. dlmfs will return 0 from read(2)
if the LVB was invalid.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
/linux-master/fs/hugetlbfs/
H A Dinode.cdiff ae62bcb5 Mon Oct 23 08:30:49 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: report f_fsid from s_dev for "simple" filesystems

There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).

In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).

(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).

(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.

The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.

9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.

Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().

Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff ae62bcb5 Mon Oct 23 08:30:49 MDT 2023 Amir Goldstein <amir73il@gmail.com> fs: report f_fsid from s_dev for "simple" filesystems

There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).

In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).

(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).

(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.

The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.

9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.

Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().

Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a08c7193 Tue Sep 26 01:20:17 MDT 2023 Sidhartha Kumar <sidhartha.kumar@oracle.com> mm/filemap: remove hugetlb special casing in filemap.c

Remove special cased hugetlb handling code within the page cache by
changing the granularity of ->index to the base page size rather than the
huge page size. The motivation of this patch is to reduce complexity
within the filemap code while also increasing performance by removing
branches that are evaluated on every page cache lookup.

To support the change in index, new wrappers for hugetlb page cache
interactions are added. These wrappers perform the conversion to a linear
index which is now expected by the page cache for huge pages.

========================= PERFORMANCE ======================================

Perf was used to check the performance differences after the patch.
Overall the performance is similar to mainline with a very small larger
overhead that occurs in __filemap_add_folio() and
hugetlb_add_to_page_cache(). This is because of the larger overhead that
occurs in xa_load() and xa_store() as the xarray is now using more entries
to store hugetlb folios in the page cache.

Timing

aarch64
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
real 1m49.568s
user 0m0.000s
sys 1m49.461s

6.5-rc3:
[root]# time fallocate -l 700GB test.txt
real 1m47.495s
user 0m0.000s
sys 1m47.370s
1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m47.024s
user 0m0.000s
sys 1m46.921s

6.5-rc3:
[root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
real 1m44.551s
user 0m0.000s
sys 1m44.438s

x86
2MB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
real 0m22.383s
user 0m0.000s
sys 0m22.255s

6.5-rc3:
[opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
real 0m22.735s
user 0m0.038s
sys 0m22.567s

1GB Page Size
6.5-rc3 + this patch:
[root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
real 0m25.786s
user 0m0.001s
sys 0m25.589s

6.5-rc3:
[root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
real 0m33.454s
user 0m0.001s
sys 0m33.193s

aarch64:
workload - fallocate a 700GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--95.04%--__pi_clear_page
|
|--3.57%--clear_huge_page
| |
| |--2.63%--rcu_all_qs
| |
| --0.91%--__cond_resched
|
--0.67%--__cond_resched
0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache
0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio

6.5-rc3
2MB Page Size:
--100.00%--__arm64_sys_fallocate
ksys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
|--94.91%--__pi_clear_page
|
|--4.11%--clear_huge_page
| |
| |--3.00%--rcu_all_qs
| |
| --1.10%--__cond_resched
|
--0.59%--__cond_resched
0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio

x86
workload - fallocate a 100GB file backed by huge pages

6.5-rc3 + this patch:
2MB Page Size:
hugetlbfs_fallocate
|
--99.57%--clear_huge_page
|
--98.47%--clear_page_erms
|
--0.53%--asm_sysvec_apic_timer_interrupt

0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio
0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store

6.5-rc3
2MB Page Size:
--99.93%--__x64_sys_fallocate
vfs_fallocate
hugetlbfs_fallocate
|
--99.38%--clear_huge_page
|
|--98.40%--clear_page_erms
|
--0.59%--__cond_resched
0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio

========================= TESTING ======================================

This patch passes libhugetlbfs tests and LTP hugetlb tests

********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 110 113
* Skipped: 0 0
* PASS: 107 113
* FAIL: 0 0
* Killed by signal: 3 0
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********

Done executing testcases.
LTP Version: 20220527-178-g2761a81c4

page migration was also tested using Mike Kravetz's test program.[8]

[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
/linux-master/fs/exfat/
H A Dfile.cdiff 0d72b928 Mon Aug 07 13:38:33 MDT 2023 Jeff Layton <jlayton@kernel.org> fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
diff 2cb1e089 Mon May 22 07:50:15 MDT 2023 David Howells <dhowells@redhat.com> splice: Use filemap_splice_read() instead of generic_file_splice_read()

Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 2f221d6f Thu Jan 21 06:19:26 MST 2021 Christian Brauner <christian.brauner@ubuntu.com> attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
diff 81df1ad4 Mon Apr 20 20:13:10 MDT 2020 Eric Sandeen <sandeen@sandeen.net> exfat: truncate atimes to 2s granularity

The timestamp for access_time has double seconds granularity(There is no
10msIncrement field for access_time unlike create/modify_time).
exfat's atimes are restricted to only 2s granularity so after
we set an atime, round it down to the nearest 2s and set the
sub-second component of the timestamp to 0.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
diff 81df1ad4 Mon Apr 20 20:13:10 MDT 2020 Eric Sandeen <sandeen@sandeen.net> exfat: truncate atimes to 2s granularity

The timestamp for access_time has double seconds granularity(There is no
10msIncrement field for access_time unlike create/modify_time).
exfat's atimes are restricted to only 2s granularity so after
we set an atime, round it down to the nearest 2s and set the
sub-second component of the timestamp to 0.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
diff 81df1ad4 Mon Apr 20 20:13:10 MDT 2020 Eric Sandeen <sandeen@sandeen.net> exfat: truncate atimes to 2s granularity

The timestamp for access_time has double seconds granularity(There is no
10msIncrement field for access_time unlike create/modify_time).
exfat's atimes are restricted to only 2s granularity so after
we set an atime, round it down to the nearest 2s and set the
sub-second component of the timestamp to 0.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>

Completed in 2384 milliseconds

123