History log of /linux-master/include/linux/fs.h
Revision Date Author Comments
# addebd9a 12-Aug-2022 Keith Busch <kbusch@kernel.org>

fs: don't randomize struct kiocb fields

This is a size sensitive structure and randomizing can introduce extra
padding that breaks io_uring's fixed size expectations. There are few
fields here as it is, half of which need a fixed order to optimally
pack, so the randomization isn't providing much.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/io-uring/b6f508ca-b1b2-5f40-7998-e4cff1cf7212@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 2b3416ce 14-Jul-2022 Yang Xu <xuyang2018.jy@fujitsu.com>

fs: add mode_strip_sgid() helper

Add a dedicated helper to handle the setgid bit when creating a new file
in a setgid directory. This is a preparatory patch for moving setgid
stripping into the vfs. The patch contains no functional changes.

Currently the setgid stripping logic is open-coded directly in
inode_init_owner() and the individual filesystems are responsible for
handling setgid inheritance. Since this has proven to be brittle as
evidenced by old issues we uncovered over the last months (see [1] to
[3] below) we will try to move this logic into the vfs.

Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Link: https://lore.kernel.org/r/1657779088-2242-1-git-send-email-xuyang2018.jy@fujitsu.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# 04b94071 27-Jul-2022 Daniil Lunev <dlunev@chromium.org>

vfs: function to prevent re-use of block-device-based superblocks

The function is to be called from filesystem-specific code to mark a
superblock to be ignored by superblock test and thus never re-used. The
function also unregisters bdi if the bdi is per-superblock to avoid
collision if a new superblock is created to represent the filesystem.
generic_shutdown_super() skips unregistering bdi for a retired superlock as
it assumes retire function has already done it.

This patch adds the functionality only for the block-device-based supers,
since the primary use case of the feature is to gracefully handle force
unmount of external devices, mounted with FUSE. This can be further
extended to cover all superblocks, if the need arises.

Signed-off-by: Daniil Lunev <dlunev@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 6f7db389 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

fsdax: dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

# 9d0ddc0c 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops->migratepage()

With all users converted to migrate_folio(), remove this operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 67235182 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()

Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 5490da4f 06-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add aops->migrate_folio

Provide a folio-based replacement for aops->migratepage. Update the
documentation to document migrate_folio instead of migratepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 68f2736a 07-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Convert all PageMovable users to movable_operations

These drivers are rather uncomfortably hammered into the
address_space_operations hole. They aren't filesystems and don't behave
like filesystems. They just need their own movable_operations structure,
which we can point to directly from page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

# 868941b1 29-Jun-2022 Jason A. Donenfeld <Jason@zx2c4.com>

fs: remove no_llseek

Now that all callers of ->llseek are going through vfs_llseek(), we
don't gain anything by keeping no_llseek around. Nothing actually calls
it and setting ->llseek to no_lseek is completely equivalent to
leaving it NULL.

Longer term (== by the end of merge window) we want to remove all such
intializations. To simplify the merge window this commit does *not*
touch initializers - it only defines no_llseek as NULL (and simplifies
the tests on file opening).

At -rc1 we'll need do a mechanical removal of no_llseek -

git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
would do it.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 164f4064 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

keep iocb_flags() result cached in struct file

* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal
opens, alloc_file() for pipe()/socket()/etc.)
* update when handling F_SETFL
* keep in a new field - file->f_iocb_flags; since that thing is needed only
before the refcount reaches zero, we can put it into the same anon union
where ->f_rcuhead and ->f_llist live - those are used only after refcount
reaches zero.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 91b94c5d 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC

New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# e87f2c26 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

struct file: use anonymous union member for rcuhead and llist

Once upon a time we couldn't afford anon unions; these days minimal
gcc version had been raised enough to take care of that.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 66fa3ced 23-Jun-2022 Stefan Roesch <shr@fb.com>

fs: Add async write file modification handling.

This adds a file_modified_async() function to return -EAGAIN if the
request either requires to remove privileges or needs to update the file
modification time. This is required for async buffered writes, so the
request gets handled in the io worker of io-uring.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220623175157.1715274-11-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 80175539 23-Jun-2022 Stefan Roesch <shr@fb.com>

fs: add a FMODE_BUF_WASYNC flags for f_mode

This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also modifies the check in
generic_write_checks to take async buffered writes into consideration.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20220623175157.1715274-8-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 9adf24a4 28-Jun-2022 Christian Brauner <brauner@kernel.org>

fs: port HAS_UNMAPPED_ID() to vfs{g,u}id_t

The HAS_UNMAPPED_ID() helper is fully self contained so we can port it
to vfs{g,u}id_t without much effort.

Cc: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# b27c82e1 21-Jun-2022 Christian Brauner <brauner@kernel.org>

attr: port attribute changes to new types

Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.

This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.

Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.

The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.

We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.

Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.

The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.

Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# 1f36146a 21-Jun-2022 Christian Brauner <brauner@kernel.org>

fs: introduce tiny iattr ownership update helpers

Nearly all fileystems currently open-code the same checks for
determining whether the i_{g,u}id fields of an inode need to be updated
and then updating the fields.

Introduce tiny helpers i_{g,u}id_needs_update() and i_{g,u}id_update()
that wrap this logic. This allows filesystems to not care about updating
inode->i_{g,u}id with the correct values themselves instead leaving this
to the helpers.

We also get rid of a lot of code duplication and make it easier to
change struct iattr in the future since changes can be localized to
these helpers.

And finally we make it hard to conflate k{g,u}id_t types with
vfs{g,u}id_t types for filesystems that support idmapped mounts.

In the following patch we will port all filesystems that raise
FS_ALLOW_IDMAP to use the new helpers. However, the ultimate goal is to
convert all filesystems to make use of these helpers.

All new helpers are nops on non-idmapped mounts.

Link: https://lore.kernel.org/r/20220621141454.2914719-5-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# 45c31150 21-Jun-2022 Christian Brauner <brauner@kernel.org>

fs: use mount types in iattr

Add ia_vfs{g,u}id members of type vfs{g,u}id_t to struct iattr. We use
an anonymous union (similar to what we do in struct file) around
ia_{g,u}id and ia_vfs{g,u}id.

At the end of this series ia_{g,u}id and ia_vfs{g,u}id will always
contain the same value independent of whether struct iattr is
initialized from an idmapped mount. This is a change from how this is
done today.

Wrapping this in a anonymous unions has a few advantages. It allows us
to avoid needlessly increasing struct iattr. Since the types for
ia_{g,u}id and ia_vfs{g,u}id are structures with overlapping/identical
members they are covered by of the C standard and it is safe
to initialize and access them.

Filesystems that raise FS_ALLOW_IDMAP and thus support idmapped mounts
will have to use ia_vfs{g,u}id and the associated helpers. And will be
ported at the end of this series. They will immediately benefit from the
type safe new helpers.

Filesystems that do not support FS_ALLOW_IDMAP can continue to use
ia_{g,u}id for now. The aim is to convert every filesystem to always use
ia_vfs{g,u}id and thus ultimately remove the ia_{g,u}id members.

Link: https://lore.kernel.org/r/20220621141454.2914719-4-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# 234a3113 21-Jun-2022 Christian Brauner <brauner@kernel.org>

fs: add two type safe mapping helpers

Introduce i_{g,u}id_into_vfs{g,u}id(). They return vfs{g,u}id_t. This
makes it way harder to confused idmapped mount {g,u}ids with filesystem

The two helpers will eventually replace the old non type safe
i_{g,u}id_into_mnt() helpers once we finished converting all places. Add
a comment noting that they will be removed in the future.

All new helpers are nops on non-idmapped mounts.

Link: https://lore.kernel.org/r/20220621141454.2914719-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

# 70f8d9c5 02-Mar-2022 Al Viro <viro@zeniv.linux.org.uk>

move mount-related externs from fs.h to mount.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 81132a39 01-Nov-2021 Gou Hao <gouhao@uniontech.com>

fs: remove fget_many and fput_many interface

These two interface were added in 091141a42 commit,
but now there is no place to call them.

The only user of fput/fget_many() was removed in commit
62906e89e63b ("io_uring: remove file batch-get optimisation").

A user of get_file_rcu_many() were removed in commit
f073531070d2 ("init: add an init_dup helper").

And replace atomic_long_sub/add to atomic_long_dec/inc
can improve performance.

Here are the test results of unixbench:

Cmd: ./Run -c 64 context1

Without patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 2798407.0 6996.0
System Benchmarks Index Score (Partial Only) 6996.0

With patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
Pipe-based Context Switching 4000.0 3486268.8 8715.7
System Benchmarks Index Score (Partial Only) 8715.7

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# fb70bf12 30-Mar-2022 Chuck Lever <chuck.lever@oracle.com>

NFSD: Instantiate a struct file when creating a regular NFSv4 file

There have been reports of races that cause NFSv4 OPEN(CREATE) to
return an error even though the requested file was created. NFSv4
does not provide a status code for this case.

To mitigate some of these problems, reorganize the NFSv4
OPEN(CREATE) logic to allocate resources before the file is actually
created, and open the new file while the parent directory is still

Two new APIs are added:

+ Add an API that works like nfsd_file_acquire() but does not open
the underlying file. The OPEN(CREATE) path can use this API when it
already has an open file.

+ Add an API that is kin to dentry_open(). NFSD needs to create a
file and grab an open "struct file *" atomically. The
alloc_empty_file() has to be done before the inode create. If it
fails (for example, because the NFS server has exceeded its
max_files limit), we avoid creating the file and can still return
an error to the NFS client.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=382
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: JianHong Yin <jiyin@redhat.com>

# 6d4675e6 19-May-2022 Minchan Kim <minchan@kernel.org>

mm: don't be stuck to rmap lock on reclaim path

The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap). It makes reclaim path stuck. In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.

This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.

Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).

In a heavy app workload in Android, trace shows following statistics. It
almost removes rmap lock contention from reclaim path.

Martin Liu reported:


max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read
601 0 601 145.544681 28817 198 rmap_walk_file


max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
NaN NaN NaN NaN NaN 0.0 NaN
0 0 0 0.127645 1 12 rmap_walk_file

[minchan@kernel.org: add comment, per Matthew]
Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

# 2443da22 02-May-2022 Dai Ngo <dai.ngo@oracle.com>

fs/lock: add 2 callbacks to lock_manager_operations to resolve conflict

Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to
lock_manager_operations to allow the lock manager to take appropriate
action to resolve the lock conflict if possible.

A new field, lm_mod_owner, is also added to lock_manager_operations.
The lm_mod_owner is used by the fs/lock code to make sure the lock
manager module such as nfsd, is not freed while lock conflict is being

lm_lock_expirable checks and returns true to indicate that the lock
conflict can be resolved else return false. This callback must be
called with the flc_lock held so it can not block.

lm_expire_lock is called to resolve the lock conflict if the returned
value from lm_lock_expirable is true. This callback is called without
the flc_lock held since it's allowed to block. Upon returning from
this callback, the lock conflict should be resolved and the caller is
expected to restart the conflict check from the beginnning of the list.

Lock manager, such as NFSv4 courteous server, uses this callback to
resolve conflict by destroying lock owner, or the NFSv4 courtesy client
(client that has expired but allowed to maintains its states) that owns
the lock.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

# 591502c5 02-May-2022 Dai Ngo <dai.ngo@oracle.com>

fs/lock: add helper locks_owner_has_blockers to check for blockers

Add helper locks_owner_has_blockers to check if there is any blockers
for a given lockowner.

Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

# a2ad63da 09-May-2022 NeilBrown <neilb@suse.de>

VFS: add FMODE_CAN_ODIRECT file flag

Currently various places test if direct IO is possible on a file by
checking for the existence of the direct_IO address space operation.
This is a poor choice, as the direct_IO operation may not be used - it is
only used if the generic_file_*_iter functions are called for direct IO
and some filesystems - particularly NFS - don't do this.

Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
various places to check this (avoiding pointer dereferences).
do_dentry_open() will set this flag if ->direct_IO is present, so
filesystems do not need to be changed.

NFS *is* changed, to set the flag explicitly and discard the direct_IO
entry in the address_space_operations for files.

Other filesystems which currently use noop_direct_IO could usefully be
changed to set this flag instead.

Link: https://lkml.kernel.org/r/164859778128.29473.15189737957277399416.stgit@noble.brown
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

# e1209d3a 09-May-2022 NeilBrown <neilb@suse.de>

mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space

swap currently uses ->readpage to read swap pages. This can only request
one page at a time from the filesystem, which is not most efficient.

swap uses ->direct_IO for writes which while this is adequate is an
inappropriate over-loading. ->direct_IO may need to had handle allocate
space for holes or other details that are not relevant for swap.

So this patch introduces a new address_space operation: ->swap_rw. In
this patch it is used for reads, and a subsequent patch will switch writes
to use it.

No filesystem yet supports ->swap_rw, but that is not a problem because
no filesystem actually works with filesystem-based swap.
Only two filesystems set SWP_FS_OPS:
- cifs sets the flag, but ->direct_IO always fails so swap cannot work.
- nfs sets the flag, but ->direct_IO calls generic_write_checks()
which has failed on swap files for several releases.

To ensure that a NULL ->swap_rw isn't called, ->activate_swap() for both
NFS and cifs are changed to fail if ->swap_rw is not set. This can be
removed if/when the function is added.

Future patches will restore swap-over-NFS functionality.

To submit an async read with ->swap_rw() we need to allocate a structure
to hold the kiocb and other details. swap_readpage() cannot handle
transient failure, so we create a mempool to provide the structures.

Link: https://lkml.kernel.org/r/164859778125.29473.13430559328221330589.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

# 7f8d12ea 29-Mar-2022 Naohiro Aota <naohiro.aota@wdc.com>

fs: add a lockdep check function for sb_start_write()

Add a function sb_write_started() to allow callers to verify if
sb_start_write() is properly called. It will be used for assertion in

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 8560cb1a 04-May-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops->freepage

All implementations now use free_folio so we can delete the callers
and the method.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

# d2329aa0 01-May-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add free_folio address space operation

Include documentation and convert the callers to use ->free_folio as
well as ->freepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

# 704ead2b 30-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove last vestiges of releasepage

All users are now converted to release_folio

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

# fa29000b 29-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add aops->release_folio

This replaces aops->releasepage. Update the documentation, and call it
if it exists.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

# 7e0a1265 29-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm,fs: Remove aops->readpage

With all implementations of aops->readpage converted to aops->read_folio,
we can stop checking whether it's set and remove the member from aops.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

# 5efe7448 29-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Introduce aops->read_folio

Change all the callers of ->readpage to call ->read_folio in preference,
if it exists. This is a transitional duplication, and will be removed
by the end of the series.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

# 520f301c 17-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Convert is_dirty_writeback() to take a folio

Pass a folio instead of a page to aops->is_dirty_writeback().
Convert both implementations and the caller.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 84a1041c 03-Mar-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove pagecache_write_begin() and pagecache_write_end()

These wrappers have no more users; remove them.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 9d6b0cd7 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove flags parameter from aops->write_begin

There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 236d93c4 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove AOP_FLAG_NOFS

With all users of this flag gone, we can stop testing whether it's set.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 56f5746c 22-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

namei: Merge page_symlink() and __page_symlink()

There are no callers of __page_symlink() left, so we can remove that
entry point.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>

# ee692a21 10-May-2022 Jens Axboe <axboe@kernel.dk>

fs,io_uring: add infrastructure for uring-cmd

file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with.

Co-developed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# d7414ba1 20-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

filemap: Remove AOP_FLAG_CONT_EXPAND

This flag is no longer used, so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

# 800ba295 19-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Pass an iocb to generic_perform_write()

We can extract both the file pointer and the pos from the iocb.
This simplifies each caller as well as allowing generic_perform_write()
to see more of the iocb contents in the future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

# b2403a61 14-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs, net: Move read_descriptor_t to net.h

fs.h has no more need for this typedef; networking is now the sole user
of the read_descriptor_t.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

# a9fcd89d 14-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove read_actor_t

This typedef is not used any more.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

# 704528d8 23-Mar-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove ->readpages address space operation

All filesystems have now been converted to use ->readahead, so
remove the ->readpages operation and fix all the comments that
used to refer to it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

# 7b12e496 07-Mar-2022 Christoph Hellwig <hch@lst.de>

fs: remove fs.f_write_hint

The value is now completely unused except for reporting it back through
the F_GET_FILE_RW_HINT ioctl, so remove the value and the two ioctls
for it.

Trying to use the F_SET_FILE_RW_HINT and F_GET_FILE_RW_HINT fcntls will
now return EINVAL, just like it would on a kernel that never supported
this functionality in the first place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 41d36a9f 07-Mar-2022 Christoph Hellwig <hch@lst.de>

fs: remove kiocb.ki_hint

This field is entirely unused now except for a tracepoint in f2fs, so
remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 8b9f3ac5 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

fs: introduce alloc_inode_sb() to allocate filesystems specific inode

The allocated inode cache is supposed to be added to its memcg list_lru
which should be allocated as well in advance. That can be done by
kmem_cache_alloc_lru() which allocates object and list_lru. The file
systems is main user of it. So introduce alloc_inode_sb() to allocate
file system specific inodes and set up the inode reclaim context
properly. The file system is supposed to use alloc_inode_sb() to
allocate inodes.

In later patches, we will convert all users to the new API.

Link: https://lkml.kernel.org/r/20220228122126.37293-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a128b054 22-Mar-2022 Anthony Iliopoulos <ailiop@suse.com>

mount: warn only once about timestamp range expiration

Commit f8b92ba67c5d ("mount: Add mount warning for impending timestamp
expiry") introduced a mount warning regarding filesystem timestamp
limits, that is printed upon each writable mount or remount.

This can result in a lot of unnecessary messages in the kernel log in
setups where filesystems are being frequently remounted (or mounted
multiple times).

Avoid this by setting a superblock flag which indicates that the warning
has been emitted at least once for any particular mount, as suggested in

Link: https://lore.kernel.org/CAHk-=wim6VGnxQmjfK_tDg6fbHYKL4EFkmnTjVr9QnRqjDBAeA@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20220119202934.26495-1-ailiop@suse.com
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 84dacdbd 22-Mar-2022 NeilBrown <neilb@suse.de>

mm: document and polish read-ahead code

Add some "big-picture" documentation for read-ahead and polish the code
to make it fit this documentation.

The meaning of ->async_size is clarified to match its name. i.e. Any
request to ->readahead() has a sync part and an async part. The caller
will wait for the sync pages to complete, but will not wait for the
async pages. The first async page is still marked PG_readahead

Note that the current function names page_cache_sync_ra() and
page_cache_async_ra() are misleading. All ra request are partly sync
and partly async, so either part can be empty. A page_cache_sync_ra()
request will usually set ->async_size non-zero, implying it is not all

When a non-zero req_count is passed to page_cache_async_ra(), the
implication is that some prefix of the request is synchronous, though
the calculation made there is incorrect - I haven't tried to fix it.

Link: https://lkml.kernel.org/r/164549983734.9187.11586890887006601405.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# cbcc268b 13-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Move many prototypes to pagemap.h

These functions are page cache functionality and don't need to be
declared in fs.h.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

# c56109dd 13-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/truncate: Combine invalidate_mapping_pagevec() and __invalidate_mapping_pages()

We can save a function call by combining these two functions, which
are identical except for the return value. Also move the prototype
to mm/internal.h.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

# 3a3bae50 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops ->set_page_dirty

With all implementations converted to ->dirty_folio, we can stop calling
this fallback method and remove it entirely.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# 6f31a5a2 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add aops->dirty_folio

This replaces ->set_page_dirty(). It returns a bool instead of an int
and takes the address_space as a parameter instead of expecting the
implementations to retrieve the address_space from the page. This is
particularly important for filesystems which use FS_OPS for swap.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# 072acba6 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops->launder_page

With all users converted to ->launder_folio, remove ->launder_page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# affa80e8 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add aops->launder_folio

Since the only difference between ->launder_page and ->launder_folio
is the type of the pointer, these can safely use a union without
affecting bisectability.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# f50015a5 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove aops->invalidatepage

With all users migrated to ->invalidate_folio, remove the old operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# 5660a863 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Remove noop_invalidatepage()

We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# 128d1f82 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Add invalidate_folio() aops method

This is used in preference to invalidatepage, if defined.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# 2e7e80f7 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Convert is_partially_uptodate to folios

Since the uptodate property is maintained on a per-folio basis, the
is_partially_uptodate method should also take a folio. Fix the types
at the same time so it's clear that it returns true/false and takes
the count in bytes, not blocks.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

# f6f7a25a 12-Aug-2021 Omar Sandoval <osandov@fb.com>

fs: export variant of generic_write_checks without iov_iter

Encoded I/O in Btrfs needs to check a write with a given logical size
without an iov_iter that matches that size (because the iov_iter we have
is for the compressed data). So, factor out the parts of
generic_write_check() that don't need an iov_iter into a new
generic_write_checks_count() function and export that.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 87112933 04-Sep-2019 Omar Sandoval <osandov@fb.com>

fs: export rw_verify_area()

I'm adding btrfs ioctls to read and write compressed data, and rather
than duplicating the checks in rw_verify_area(), let's just export it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 5298d4bf 17-Jan-2022 Christoph Hellwig <hch@lst.de>

unicode: clean up the Kconfig symbol confusion

Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d04787012 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

# 0a4ee518 21-Jan-2022 Christoph Hellwig <hch@lst.de>

mm: remove cleancache

Patch series "remove Xen tmem leftovers".

Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.

This patch (of 13):

The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dcd0 ("xen: remove tmem driver").

[akpm@linux-foundation.org: remove now-unreachable code]

Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 9c011be1 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move namei sysctls to its own file

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move namei's own sysctl knobs to its own file.

Other than the move we also avoid initializing two static variables to 0
as this is not needed:

* sysctl_protected_symlinks
* sysctl_protected_hardlinks

Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# dd81faa8 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move locking sysctls where they are used

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

The locking fs sysctls are only used on fs/locks.c, so move them there.

Link: https://lkml.kernel.org/r/20211129205548.605569-7-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c8c0c239 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move dcache sysctls to its own file

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the dcache sysctl clutter out of kernel/sysctl.c. This is a
small one-off entry, perhaps later we can simplify this representation,
but for now we use the helpers we have. We won't know how we can
simplify this further untl we're fully done with the cleanup.

[arnd@arndb.de: avoid unused-function warning]
Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 204d5a24 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move fs stat sysctls to file_table.c

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

We can create the sysctl dynamically on early init for fs stat to help
with this clutter. This dusts off the fs stat syctls knobs and puts
them into where they are declared.

Link: https://lkml.kernel.org/r/20211129205548.605569-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 1d67fe58 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move inode sysctls to its own file

Patch series "sysctl: 4th set of kernel/sysctl cleanups".

This is slimming down the fs uses of kernel/sysctl.c to the point that
the next step is to just get rid of the fs base directory for it and
move that elsehwere, so that next patch series starts dealing with that
to demo how we can end up cleaning up a full base directory from
kernel/sysctl.c, one at a time.

This patch (of 9):

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the inode sysctls to its own file. Since we are no longer using
this outside of fs/ remove the extern declaration of its respective proc

We use early_initcall() as it is the earliest we can use.

[arnd@arndb.de: avoid unused-variable warning]
Link: https://lkml.kernel.org/r/20211203190123.874239-1-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211129205548.605569-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211129205548.605569-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 3e9d80a8 14-Jan-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm,fs: split dump_mapping() out from dump_page()

dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
able to call it when we don't have a struct page. Split it out and move
it to fs/inode.c. Take the opportunity to simplify some of the debug
messages a little.

Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0ea9fc15 23-Nov-2021 Arnd Bergmann <arnd@arndb.de>

fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes

My patch to rework oabi fcntl64() introduced a harmless
sparse warning when file locking is disabled:

arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: sparse: incorrect type in argument 3 (different address spaces) @@ expected struct flock64 [noderef] __user *user @@ got struct flock64 * @@
arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: expected struct flock64 [noderef] __user *user
arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: got struct flock64 *
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: sparse: incorrect type in argument 4 (different address spaces) @@ expected struct flock64 [noderef] __user *user @@ got struct flock64 * @@
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: expected struct flock64 [noderef] __user *user
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: got struct flock64 *

When file locking is enabled, everything works correctly and the
right data gets passed, but the stub declarations in linux/fs.h
did not get modified when the calling conventions changed in an
earlier patch.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 7e2d8c29ecdd ("ARM: 9111/1: oabi-compat: rework fcntl64() emulation")
Fixes: a75d30c77207 ("fs/locks: pass kernel struct flock to fcntl_getlk/setlk")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

# 1bd9c4e4 18-Nov-2021 David Howells <dhowells@redhat.com>

vfs, cachefiles: Mark a backing file in use with an inode flag

Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.

Alter rmdir to reject attempts to remove a directory marked with this flag.
This is used by cachefiles to prevent cachefilesd from removing them.

Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.

ver #3:
- Check for the object pointer being NULL in the tracepoints rather than
the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4

# 08276bda 20-Oct-2021 David Howells <dhowells@redhat.com>

vfs, fscache: Implement pinning of cache usage for writeback

Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.

Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.

To get around this, I added the following:

(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.

(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this

This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears

(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache

(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.

(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any

The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.

For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4

# 4bdcd1dd 28-Oct-2021 Jens Axboe <axboe@kernel.dk>

mm: move filemap_range_needs_writeback() into header

No functional changes in this patch, just in preparation for efficiently
calling this light function from the block O_DIRECT handling.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# bd303368 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: support mapped mounts of mapped filesystems

In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.

With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.

In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.

As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this

This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.

Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# a1ec9040 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: add i_user_ns() helper

Since we'll be passing the filesystem's idmapping in even more places in
the following patches and we do already dereference struct inode to get
to the filesystem's idmapping multiple times add a tiny helper.

Link: https://lore.kernel.org/r/20211123114227.3124056-10-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-10-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-10-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 209188ce 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: port higher-level mapping helpers

Enable the mapped_fs{g,u}id() helpers to support filesystems mounted
with an idmapping. Apart from core mapping helpers that use
mapped_fs{g,u}id() to initialize struct inode's i_{g,u}id fields xfs is
the only place that uses these low-level helpers directly.

The patch only extends the helpers to be able to take the filesystem
idmapping into account. Since we don't actually yet pass the
filesystem's idmapping in no functional changes happen. This will happen
in a final patch.

Link: https://lore.kernel.org/r/20211123114227.3124056-9-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-9-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 1ac2a410 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: account for filesystem mappings

Currently we only support idmapped mounts for filesystems mounted
without an idmapping. This was a conscious decision mentioned in
multiple places (cf. e.g. [1]).

As explained at length in [3] it is perfectly fine to extend support for
idmapped mounts to filesystem's mounted with an idmapping should the
need arise. The need has been there for some time now. Various container
projects in userspace need this to run unprivileged and nested
unprivileged containers (cf. [2]).

Before we can port any filesystem that is mountable with an idmapping to
support idmapped mounts we need to first extend the mapping helpers to
account for the filesystem's idmapping. This again, is explained at
length in our documentation at [3] but I'll give an overview here again.

Currently, the low-level mapping helpers implement the remapping
algorithms described in [3] in a simplified manner. Because we could
rely on the fact that all filesystems supporting idmapped mounts are
mounted without an idmapping the translation step from or into the
filesystem idmapping could be skipped.

In order to support idmapped mounts of filesystem's mountable with an
idmapping the translation step we were able to skip before cannot be
skipped anymore. A filesystem mounted with an idmapping is very likely
to not use an identity mapping and will instead use a non-identity
mapping. So the translation step from or into the filesystem's idmapping
in the remapping algorithm cannot be skipped for such filesystems. More
details with examples can be found in [3].

This patch adds a few new and prepares some already existing low-level
mapping helpers to perform the full translation algorithm explained in
[3]. The low-level helpers can be written in a way that they only
perform the additional translation step when the filesystem is indeed
mounted with an idmapping.

If the low-level helpers detect that they are not dealing with an
idmapped mount they can simply return the relevant k{g,u}id unchanged;
no remapping needs to be performed at all. The no_idmapping() helper
detects whether the shortcut can be used.

If the low-level helpers detected that they are dealing with an idmapped
mount but the underlying filesystem is mounted without an idmapping we
can rely on the previous shorcut and can continue to skip the
translation step from or into the filesystem's idmapping.

These checks guarantee that only the minimal amount of work is
performed. As before, if idmapped mounts aren't used the low-level
helpers are idempotent and no work is performed at all.

This patch adds the helpers mapped_k{g,u}id_fs() and
mapped_k{g,u}id_user(). Following patches will port all places to
replace the old k{g,u}id_into_mnt() and k{g,u}id_from_mnt() with these
two new helpers. After the conversion is done k{g,u}id_into_mnt() and
k{g,u}id_from_mnt() will be removed. This also concludes the renaming of
the mapping helpers we started in [4]. Now, all mapping helpers will
started with the "mapped_" prefix making everything nice and consistent.

The mapped_k{g,u}id_fs() helpers replace the k{g,u}id_into_mnt()
helpers. They are to be used when k{g,u}ids are to be mapped from the
vfs, e.g. from from struct inode's i_{g,u}id. Conversely, the
mapped_k{g,u}id_user() helpers replace the k{g,u}id_from_mnt() helpers.
They are to be used when k{g,u}ids are to be written to disk, e.g. when
entering from a system call to change ownership of a file.

This patch only introduces the helpers. It doesn't yet convert the
relevant places to account for filesystem mounted with an idmapping.

[1]: commit 2ca4dcc4909d ("fs/mount_setattr: tighten permission checks")
[2]: https://github.com/containers/podman/issues/10374
[3]: Documentations/filesystems/idmappings.rst
[4]: commit a65e58e791a1 ("fs: document and rename fsid helpers")

Link: https://lore.kernel.org/r/20211123114227.3124056-5-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-5-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-5-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 476860b3 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: tweak fsuidgid_has_mapping()

If the caller's fs{g,u}id aren't mapped in the mount's idmapping we can
return early and skip the check whether the mapped fs{g,u}id also have a
mapping in the filesystem's idmapping. If the fs{g,u}id aren't mapped in
the mount's idmapping they consequently can't be mapped in the
filesystem's idmapping. So there's no point in checking that.

Link: https://lore.kernel.org/r/20211123114227.3124056-4-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-4-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-4-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# a793d79e 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: move mapping helpers

The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# bb49e9e7 02-Dec-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: add is_idmapped_mnt() helper

Multiple places open-code the same check to determine whether a given
mount is idmapped. Introduce a simple helper function that can be used
instead. This allows us to get rid of the fragile open-coding. We will
later change the check that is used to determine whether a given mount
is idmapped. Introducing a helper allows us to do this in a single
place instead of doing it for multiple places.

Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# ff36da69 29-Aug-2021 Matthew Wilcox (Oracle) <willy@infradead.org>


Instead of setting a bit in the fs_flags to set a bit in the
address_space, set the bit in the address_space directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

# 6429e463 28-Oct-2021 Lorenz Bauer <lmb@cloudflare.com>

libfs: Move shmem_exchange to simple_rename_exchange

Move shmem_exchange and make it available to other callers.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20211028094724.59043-2-lmb@cloudflare.com

# 51b8c1fe 08-Nov-2021 Johannes Weiner <hannes@cmpxchg.org>

vfs: keep inodes with page cache off the inode shrinker LRU

Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.

To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.

Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.

The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.

To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).

The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.

Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.

Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.

Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0b3ea092 05-Nov-2021 Christoph Hellwig <hch@lst.de>

fs: explicitly unregister per-superblock BDIs

Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
associated with them, and unregister it when the superblock is shut

Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# e60feb44 14-Oct-2021 Josef Bacik <josef@toxicpanda.com>

fs: export an inode_update_time helper

If you already have an inode and need to update the time on the inode
there is no way to do this properly. Export this helper to allow file
systems to update time on the inode so the appropriate handler is
called, either ->update_time or generic_update_time.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 6b19b766 21-Oct-2021 Jens Axboe <axboe@kernel.dk>

fs: get rid of the res2 iocb->ki_complete argument

The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 5a72e899 12-Oct-2021 Jens Axboe <axboe@kernel.dk>

block: add a struct io_comp_batch argument to fops->iopoll()

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# ef99b2d3 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: replace the spin argument to blk_iopoll with a flags argument

Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 6e7c1770 14-Jul-2021 Christoph Hellwig <hch@lst.de>

fs: simplify get_filesystem_list / get_all_fs_names

Just output the '\0' separate list of supported file systems for block
devices directly rather than going through a pointless round of string

Based on an earlier patch from Al Viro <viro@zeniv.linux.org.uk>.

Modified list_bdev_fs_names() and split_fs_names() to return number of
null terminted strings to caller. Callers now use that information to
loop through all the strings instead of relying on one extra null char
being present at the end.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 7bb698f0 28-Jul-2021 Andreas Gruenbacher <agruenba@redhat.com>

fs: Move notify_change permission checks into may_setattr

Move the permission checks in notify_change into a separate function to
make them available to filesystems.

When notify_change is called, the vfs performs those checks before
calling into iop->setattr. However, a filesystem like gfs2 can only
lock and revalidate the inode inside ->setattr, and it must then repeat
those checks to err on the safe side.

It would be nice to get rid of the double checking, but moving the
permission check into iop->setattr altogether isn't really an option.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 592ca09b 23-Apr-2021 David Hildenbrand <david@redhat.com>

fs: update documentation of get_write_access() and friends

As VM_DENYWRITE does no longer exists, let's spring-clean the
documentation of get_write_access() and friends.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

# 332f606b 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: enable RCU'd ->get_acl()

Overlayfs does not cache ACL's (to avoid double caching). Instead it just
calls the underlying filesystem's i_op->get_acl(), which will return the
cached value, if possible.

In rcu path walk, however, get_cached_acl_rcu() is employed to get the
value from the cache, which will fail on overlayfs resulting in dropping
out of rcu walk mode. This can result in a big performance hit in certain

Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which
indicates pass-through)

Reported-by: garyhuang <zjh.20052005@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 0cad6246 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: add rcu argument to ->get_acl() callback

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 4f911138 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

fs: add generic helper for filling statx attribute flags

The immutable and append-only properties on an inode are published on
the inode's i_flags and enforced by the VFS.

Create a helper to fill the corresponding STATX_ATTR_ flags in the kstat
structure from the inode's i_flags.

Only orange was converted to use this helper.
Other filesystems could use it in the future.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# bb0a55bb 20-Aug-2021 J. Bruce Fields <bfields@redhat.com>

nfs: don't allow reexport reclaims

In the reexport case, nfsd is currently passing along locks with the
reclaim bit set. The client sends a new lock request, which is granted
if there's currently no conflict--even if it's possible a conflicting
lock could have been briefly held in the interim.

We don't currently have any way to safely grant reclaim, so for now
let's just deny them all.

I'm doing this by passing the reclaim bit to nfs and letting it fail the
call, with the idea that eventually the client might be able to do
something more forgiving here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

# 5662c967 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

fs: kill sync_inode

Now that all users of sync_inode() have been deleted, remove

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 5a798493 14-Jul-2021 Josef Bacik <josef@toxicpanda.com>

fs: add a filemap_fdatawrite_wbc helper

Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
order to reclaim metadata reservations. Unfortunately most helpers in
this area are too smart for us:

1) The normal filemap_fdata* helpers only take range and sync modes, and
don't give any indication of how much was written, so we can only
flush full inodes, which isn't what we want in most cases.
2) The normal writeback path requires us to have the s_umount sem held,
but we can't unconditionally take it in this path because we could
3) The normal writeback path also skips inodes with I_SYNC set if we
write with WB_SYNC_NONE. This isn't the behavior we want under heavy
ENOSPC pressure, we want to actually make sure the pages are under
writeback before returning, and if another thread is in the middle of
writing the file we may return before they're under writeback and
miss our ordered extents and not properly wait for completion.
4) sync_inode() uses the normal writeback path and has the same problem
as #3.

What we really want is to call do_writepages() with our wbc. This way
we can make sure that writeback is actually started on the pages, and we
can control how many pages are written as a whole as we write many
inodes using the same wbc. Accomplish this with a new helper that does
just that so we can use it for our ENOSPC flushing infrastructure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 158ee7b6 24-Aug-2021 Christoph Hellwig <hch@lst.de>

block: mark blkdev_fsync static

blkdev_fsync is only used inside of block_dev.c since the
removal of the raw drіver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210824151823.1575100-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 6c7ef543 10-Aug-2021 Jens Axboe <axboe@kernel.dk>

fs: add kiocb alloc cache flag

If this kiocb can safely use the polled bio allocation cache, then this
flag must be set. Generally this can be set for polled IO, where we will
not see IRQ completions of the request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 8228e2c3 08-Jul-2021 Dmitry Kadashev <dkadashev@gmail.com>

namei: add getname_uflags()

There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future. Let's just add a simple helper getname_uflags() that handles
this directly and use it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# f7e33bdb 19-Aug-2021 Jeff Layton <jlayton@kernel.org>

fs: remove mandatory file locking support

We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>

# ec44610f 10-Aug-2021 Amir Goldstein <amir73il@gmail.com>

fsnotify: count all objects with attached connectors

Rename s_fsnotify_inode_refs to s_fsnotify_connectors and count all
objects with attached connectors, not only inodes with attached

This will be used to optimize fsnotify() calls on sb without any
type of marks.

Link: https://lore.kernel.org/r/20210810151220.285179-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Matthew Bobrowski <repnop@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 7506ae6a 24-May-2021 Jan Kara <jack@suse.cz>

mm: Add functions to lock invalidate_lock for two mappings

Some operations such as reflinking blocks among files will need to lock
invalidate_lock for two mappings. Add helper functions to do that.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>

# 730633f0 28-Jan-2021 Jan Kara <jack@suse.cz>

mm: Protect operations adding pages to page cache with invalidate_lock

Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is

fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
<create pages in page
cache here>
<update fs block mapping and free blocks>

Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.

So create a new rw_semaphore in the address_space - invalidate_lock -
that protects adding of pages to page cache for page faults / reads /

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>

# 603e4922 31-May-2021 Christoph Hellwig <hch@lst.de>

remove the raw driver

The raw driver used to provide direct unbuffered access to block devices
before O_DIRECT was invented. It has been obsolete for more than a

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/Pine.LNX.4.64.0703180754060.6605@CPE00045a9c397f-CM001225dbafb6/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210531072526.97052-1-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# ffb37ca3 01-Apr-2021 Al Viro <viro@zeniv.linux.org.uk>

switch file_open_root() to struct path

... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# acc6100d 10-Jun-2021 Zhang Yi <yi.zhang@huawei.com>

fs: remove bdev_try_to_free_page callback

After remove the unique user of sop->bdev_try_to_free_page() callback,
we could remove the callback and the corresponding blkdev_releasepage()
at all.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210610112440.3438139-9-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# b82a96c9 28-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: remove noop_set_page_dirty()

Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future. It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.

[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]

Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c1e3dbe9 28-Jun-2021 Christoph Hellwig <hch@lst.de>

fs: move ramfs_aops to libfs

Move the ramfs aops to libfs and reuse them for kernfs and configfs.
Thosw two did not wire up ->set_page_dirty before and now get
__set_page_dirty_no_writeback, which is the right one for no-writeback
address_space usage.

Drop the now unused exports of the libfs helpers only used for ramfs-style
pagecache usage.

Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# bbcd53c9 06-May-2021 David Hildenbrand <david@redhat.com>

drivers/char: remove /dev/kmem for good

Patch series "drivers/char: remove /dev/kmem for good".

Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
memory ballooning, I started questioning the existence of /dev/kmem.

Comparing it with the /proc/kcore implementation, it does not seem to be
able to deal with things like

a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
-> kern_addr_valid(). virt_addr_valid() is not sufficient.

b) Special cases like gart aperture memory that is not to be touched
-> mem_pfn_is_ram()

Unless I am missing something, it's at least broken in some cases and might
fault/crash the machine.

Looks like its existence has been questioned before in 2005 and 2010 [1],
after ~11 additional years, it might make sense to revive the discussion.

CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
mistake?). All distributions disable it: in Ubuntu it has been disabled
for more than 10 years, in Debian since 2.6.31, in Fedora at least
starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
15sp2, and OpenSUSE has it disabled as well.

1) /dev/kmem was popular for rootkits [2] before it got disabled
basically everywhere. Ubuntu documents [3] "There is no modern user of
/dev/kmem any more beyond attackers using it to load kernel rootkits.".
RHEL documents in a BZ [5] "it served no practical purpose other than to
serve as a potential security problem or to enable binary module drivers
to access structures/functions they shouldn't be touching"

2) /proc/kcore is a decent interface to have a controlled way to read
kernel memory for debugging puposes. (will need some extensions to
deal with memory offlining/unplug, memory ballooning, and poisoned
pages, though)

3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
better fit, especially, to write random memory; harder to shoot
yourself into the foot.

4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
to be incompatible with 64bit [1]. For educational purposes,
/proc/kcore might be used to monitor value updates -- or older
kernels can be used.

5) It's broken on arm64, and therefore, completely disabled there.

Looks like it's essentially unused and has been replaced by better
suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
just remove it.

[1] https://lwn.net/Articles/147901/
[2] https://www.linuxjournal.com/article/10505
[3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
[4] https://sourceforge.net/projects/kme/
[5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Gregory Clement <gregory.clement@bootlin.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: James Troup <james.troup@canonical.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <kasong@redhat.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: openrisc@lists.librecores.org
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Pavel Machek (CIP)" <pavel@denx.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: sparclinux@vger.kernel.org
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Theodore Dubois <tblodt@icloud.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: William Cohen <wcohen@redhat.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 8bc3c481 04-May-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove nrexceptional from inode

We no longer track anything in nrexceptional, so remove it, saving 8 bytes
per inode.

Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 63135aa3 29-Apr-2021 Jens Axboe <axboe@kernel.dk>

mm: provide filemap_range_needs_writeback() helper

Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.

An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.

For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was

Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.

This patch (of 3):

For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.

Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.

Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c790fbf2 07-Apr-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Document file_ra_state

Turn the comments into kernel-doc and improve the wording slightly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/20210407201857.3582797-3-willy@infradead.org/
Link: https://lore.kernel.org/r/161789068619.6155.1397999970593531574.stgit@warthog.procyon.org.uk/ # v6

# 51db776a 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove unused ioctl helpers

Remove vfs_ioc_setflags_prepare(), vfs_ioc_fssetxattr_check() and
simple_fill_fsxattr(), which are no longer used.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

# 4c5b4799 07-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: add fileattr ops

There's a substantial amount of boilerplate in filesystems handling

Also due to userspace buffers being involved in the ioctl API this is
difficult to stack, as shown by overlayfs issues related to these ioctls.

Introduce a new internal API named "fileattr" (fsxattr can be confused with
xattr, xflags is inappropriate, since this is more than just flags).

There's significant overlap between flags and xflags and this API handles
the conversions automatically, so filesystems may choose which one to use.

In ->fileattr_get() a hint is provided to the filesystem whether flags or
xattr are being requested by userspace, but in this series this hint is
ignored by all filesystems, since generating all the attributes is cheap.

If a filesystem doesn't implemement the fileattr API, just fall back to
f_op->ioctl(). When all filesystems are converted, the fallback can be

32bit compat ioctls are now handled by the generic code as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 92cb01c7 01-Mar-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: update kernel-doc for vfs_rename()

Commit 9fe61450972d ("namei: introduce struct renamedata") introduces a
new struct for vfs_rename() and makes the vfs_rename() kernel-doc argument
description out of sync.

Move the description of arguments for vfs_rename() to a new kernel-doc for
the struct renamedata to make these descriptions checkable against the
actual implementation.

Link: https://lore.kernel.org/r/20210204180059.28360-3-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 39015399 04-Feb-2021 Lukas Bulwahn <lukas.bulwahn@gmail.com>

fs: turn some comments into kernel-doc

While reviewing ./include/linux/fs.h, I noticed that three comments can
actually be turned into kernel-doc comments. This allows to check the
consistency between the descriptions and the functions' signatures in
case they may change in the future.

A quick validation with the consistency check:

./scripts/kernel-doc -none include/linux/fs.h

currently reports no issues in this file.

Link: https://lore.kernel.org/r/20210204180059.28360-2-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# db998553 20-Mar-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: introduce two inode i_{u,g}id initialization helpers

Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many

Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 8e538913 20-Mar-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: introduce fsuidgid_has_mapping() helper

Don't open-code the checks and instead move them into a clean little
helper we can call. This also reduces the risk that if we ever change
something we forget to change all locations.

Link: https://lore.kernel.org/r/20210320122623.599086-4-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# a65e58e7 20-Mar-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: document and rename fsid helpers

Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.

Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 1bd66c1a 20-Mar-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: document mapping helpers

Document new helpers we introduced this cycle.

Link: https://lore.kernel.org/r/20210320122623.599086-2-christian.brauner@ubuntu.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 6e3e2c43 01-Mar-2021 Al Viro <viro@zeniv.linux.org.uk>

new helper: inode_wrong_type()

inode_wrong_type(inode, mode) returns true if setting inode->i_mode
to given value would've changed the inode type. We have enough of
those checks open-coded to make a helper worthwhile.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 87fa0f3e 24-Feb-2021 Christoph Hellwig <hch@lst.de>

mm/filemap: rename generic_file_buffered_read to filemap_read

Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.

Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 643fe55a 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

open: handle idmapped mounts in do_truncate()

When truncating files the vfs will verify that the caller is privileged
over the inode. Extend it to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 6521f891 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: prepare for idmapped mounts

The various vfs_*() helpers are called by filesystems or by the vfs
itself to perform core operations such as create, link, mkdir, mknod, rename,
rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
inode is accessed through an idmapped mount map it into the
mount's user namespace and pass it down. Afterwards the checks and
operations are identical to non-idmapped mounts. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 9fe61450 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: introduce struct renamedata

In order to handle idmapped mounts we will extend the vfs rename helper
to take two new arguments in follow up patches. Since this operations
already takes a bunch of arguments add a simple struct renamedata and
make the current helper use it before we extend it.

Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# ba73d987 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: handle idmapped mounts in may_*() helpers

The may_follow_link(), may_linkat(), may_lookup(), may_open(),
may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
helpers determine whether the caller is privileged enough to perform the
associated operations. Let them handle idmapped mounts by mapping the
inode or fsids according to the mount's user namespace. Afterwards the
checks are identical to non-idmapped inodes. The patch takes care to
retrieve the mount's user namespace right before performing permission
checks and passing it down into the fileystem so the user namespace
can't change in between by someone idmapping a mount that is currently
not idmapped. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 0d56a451 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

stat: handle idmapped mounts

The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 2f221d6f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 21cb47be 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

inode: make init and permission helpers idmapped mount aware

The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 47291baa 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: make permission helpers idmapped mount aware

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# 02f92b38 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: add file and path permissions helpers

Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# e6c9a714 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: add id translation helpers

Add simple helpers to make it easy to map kuids into and from idmapped
mounts. We provide simple wrappers that filesystems can use to e.g.
initialize inodes similar to i_{uid,gid}_read() and i_{uid,gid}_write().
Accessing an inode through an idmapped mount maps the i_uid and i_gid of
the inode to the mount's user namespace. If the fsids are used to
initialize inodes they are unmapped according to the mount's user
namespace. Passing the initial user namespace to these helpers makes
them a nop and so any non-idmapped paths will not be impacted.

Link: https://lore.kernel.org/r/20210121131959.646623-3-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# a6435940 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

mount: attach mappings to mounts

In order to support per-mount idmappings vfsmounts are marked with user
namespaces. The idmapping of the user namespace will be used to map the
ids of vfs objects when they are accessed through that mount. By default
all vfsmounts are marked with the initial user namespace. The initial
user namespace is used to indicate that a mount is not idmapped. All
operations behave as before.

Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users to
setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
Later patches enforce that once a mount has been idmapped it can't be
remapped. This keeps permission checking and life-cycle management
simple. Users wanting to change the idmapped can always create a new
detached mount with a different idmapping.

Add a new mnt_userns member to vfsmount and two simple helpers to
retrieve the mnt_userns from vfsmounts and files.

The idea to attach user namespaces to vfsmounts has been floated around
in various forms at Linux Plumbers in ~2018 with the original idea
tracing back to a discussion in 2017 at a conference in St. Petersburg
between Christoph, Tycho, and myself.

Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

# ed296c6c 12-Jan-2021 Eric Biggers <ebiggers@google.com>

ext4: simplify i_state checks in __ext4_update_other_inode_time()

Since I_DIRTY_TIME and I_DIRTY_INODE are mutually exclusive in i_state,
there's no need to check for I_DIRTY_TIME && !I_DIRTY_INODE. Just check

Also introduce a helper function in include/linux/fs.h to do this check.

Link: https://lore.kernel.org/r/20210112190253.64307-12-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 1e9d6333 12-Jan-2021 Eric Biggers <ebiggers@google.com>

fs: correctly document the inode dirty flags

The documentation for I_DIRTY_SYNC and I_DIRTY_DATASYNC is a bit
misleading, and I_DIRTY_TIME isn't documented at all. Fix this.

Link: https://lore.kernel.org/r/20210112190253.64307-3-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 794c43f7 28-Dec-2020 Eric Biggers <ebiggers@google.com>

libfs: unexport generic_ci_d_compare() and generic_ci_d_hash()

Now that generic_set_encrypted_ci_d_ops() has been added and ext4 and
f2fs are using it, it's no longer necessary to export
generic_ci_d_compare() and generic_ci_d_hash() to filesystems.

Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

# 52650c8b 14-Dec-2020 Jason Gunthorpe <jgg@ziepe.ca>

mm/gup: remove the vma allocation from gup_longterm_locked()

Long ago there wasn't a FOLL_LONGTERM flag so this DAX check was done by
post-processing the VMA list.

These days it is trivial to just check each VMA to see if it is DAX before
processing it inside __get_user_pages() and return failure if a DAX VMA is
encountered with FOLL_LONGTERM.

Removing the allocation of the VMA list is a significant speed up for many
call sites.

Add an IS_ENABLED to vma_is_fsdax so that code generation is unchanged
when DAX is compiled out.

Remove the dummy version of __gup_longterm_locked() as !CONFIG_CMA already
makes memalloc_nocma_save(), check_and_migrate_cma_pages(), and
memalloc_nocma_restore() into a NOP.

Link: https://lkml.kernel.org/r/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 88149082 07-Dec-2020 Hao Li <lihao2018.fnst@cn.fujitsu.com>

fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()

If generic_drop_inode() returns true, it means iput_final() can evict
this inode regardless of whether it is dirty or not. If we check
I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be
evicted unconditionally. This is not the desired behavior because
I_DONTCACHE only means the inode shouldn't be cached on the LRU list.
As for whether we need to evict this inode, this is what
generic_drop_inode() should do. This patch corrects the usage of

This patch was proposed in [1].

[1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/

Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 608af703 18-Nov-2020 Daniel Rosenberg <drosen@google.com>

libfs: Add generic function for setting dentry_ops

This adds a function to set dentry operations at lookup time that will
work for both encrypted filenames and casefolded filenames.

A filesystem that supports both features simultaneously can use this
function during lookup preparations to set up its dentry operations once
fscrypt no longer does that itself.

Currently the casefolding dentry operation are always set if the
filesystem defines an encoding because the features is toggleable on
empty directories. Unlike in the encryption case, the dentry operations
used come from the parent. Since we don't know what set of functions
we'll eventually need, and cannot change them later, we enable the
casefolding operations if the filesystem supports them at all.

By splitting out the various cases, we support as few dentry operations
as we can get away with, maximizing compatibility with overlayfs, which
will not function if a filesystem supports certain dentry_operations.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

# 4e7b5671 23-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove i_bdev

Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case). This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 60b49885 16-Nov-2020 Christoph Hellwig <hch@lst.de>

fs: remove get_super_thawed and get_super_exclusive_thawed

Just open code the wait in the only caller of both functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 9b852342 10-Nov-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: move __sb_{start,end}_write* to fs.h

Now that we've straightened out the callers, move these three functions
to fs.h since they're fairly trivial.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>

# 8a3c84b6 10-Nov-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: separate __sb_start_write into blocking and non-blocking helpers

Break this function into two helpers so that it's obvious that the
trylock versions return a value that must be checked, and the blocking
versions don't require that. While we're at it, clean up the return
type mismatch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 5e01fdff 31-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

fs: Replace zero-length array with flexible-array member

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>

# 319c1517 01-Oct-2020 Al Viro <viro@zeniv.linux.org.uk>

epoll: take epitem list out of struct file

Move the head of epitem list out of struct file; for epoll ones it's
moved into struct eventpoll (->refs there), for non-epoll - into
the new object (struct epitem_head). In place of ->f_ep_links we
leave a pointer to the list head (->f_ep).

->f_ep is protected by ->f_lock and it's zeroed as soon as the list
of epitems becomes empty (that can happen only in ep_remove() by

The list of files for reverse path check is *not* going through
struct file now - it's a single-linked list going through epitem_head
instances. It's terminated by ERR_PTR(-1) (== EP_UNACTIVE_POINTER),
so the elements of list can be distinguished by head->next != NULL.

epitem_head instances are allocated at ep_insert() time (by
attach_epitem()) and freed either by ep_remove() (if it empties
the set of epitems *and* epitem_head does not belong to the
reverse path check list) or by clear_tfile_check_list() when
the list is emptied (if the set of epitems is empty by that
point). Allocations are done from a separate slab - minimal kmalloc()
size is too large on some architectures.

As the result, we trim struct file _and_ get rid of the games with
temporary file references.

Locking and barriers are interesting (aren't they always); see unlist_file()
and ep_remove() for details. The non-obvious part is that ep_remove() needs
to decide if it will be the one to free the damn thing *before* actually
storing NULL to head->epitems.first - that's what smp_load_acquire is for
in there. unlist_file() lockless path is safe, since we hit it only if
we observe NULL in head->epitems.first and whoever had done that store is
guaranteed to have observed non-NULL in head->next. IOW, their last access
had been the store of NULL into ->epitems.first and we can safely free
the sucker. OTOH, we are under rcu_read_lock() and both epitem and
epitem->file have their freeing RCU-delayed. So if we see non-NULL
->epitems.first, we can grab ->f_lock (all epitems in there share the
same struct file) and safely recheck the emptiness of ->epitems; again,
->next is still non-NULL, so ep_remove() couldn't have freed head yet.
->f_lock serializes us wrt ep_remove(); the rest is trivial.

Note that once head->epitems becomes NULL, nothing can get inserted into
it - the only remaining reference to head after that point is from the
reverse path check list.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 44cdc1d9 27-Sep-2020 Al Viro <viro@zeniv.linux.org.uk>

convert ->f_ep_links/->fllink to hlist

we don't care about the order of elements there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 6f4d2f97 15-Oct-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: do not update nr_thps for mappings which support THPs

The nr_thps counter is to support THPs in the page cache when the
filesystem doesn't understand THPs. Eventually it will be removed, but we
should still support filesystems which do not understand THPs yet. Move
the nr_thp manipulation functions to filemap.h since they're page-cache

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20200916032717.22917-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 01c70267 15-Oct-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: add a filesystem flag for THPs

The page cache needs to know whether the filesystem supports THPs so that
it doesn't send THPs to filesystems which can't handle them. Dave Chinner
points out that getting from the page mapping to the filesystem type is
too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
information in the address space flags.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 407e9c63 15-Oct-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: move the generic write and copy checks out of mm

The generic write check helpers also don't have much to do with the page
cache, so move them to the vfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 1b2c54d6 14-Oct-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: move the remap range helpers to remap_range.c

Complete the migration by moving the file remapping helper functions out
of read_write.c and into remap_range.c. This reduces the clutter in the
first file and (eventually) will make it so that we can compile out the
second file if it isn't needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 02e83f46 14-Oct-2020 Darrick J. Wong <darrick.wong@oracle.com>

vfs: move generic_remap_checks out of mm

I would like to move all the generic helpers for the vfs remap range
functionality (aka clonerange and dedupe) into a separate file so that
they won't be scattered across the vfs and the mm subsystems. The
eventual goal is to be able to deselect remap_range.c if none of the
filesystems need that code, but the tricky part here is picking a
stable(ish) part of the merge window to rearrange code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# eb1d7a65 13-Oct-2020 Yafang Shao <laoar.shao@gmail.com>

mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED

Our users reported that there're some random latency spikes when their RT
process is running. Finally we found that latency spike is caused by
FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on
remote CPUs, and then waits the per-cpu work to complete. The wait time
is uncertain, which may be tens millisecond.

That behavior is unreasonable, because this process is bound to a specific
CPU and the file is only accessed by itself, IOW, there should be no
pagecache pages on a per-cpu pagevec of a remote CPU. That unreasonable
behavior is partially caused by the wrong comparation of the number of
invalidated pages and the number of the target. For example,

if (count < (end_index - start_index + 1))

The count above is how many pages were invalidated in the local CPU, and
(end_index - start_index + 1) is how many pages should be invalidated.
The usage of (end_index - start_index + 1) is incorrect, because they are
virtual addresses, which may not mapped to pages. Besides that, there may
be holes between start and end. So we'd better check whether there are
still pages on per-cpu pagevec after drain the local cpu, and then decide
whether or not to call lru_add_drain_all().

After I applied it with a hotfix to our production environment, most of
the lru_add_drain_all() can be avoided.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c33fe275 24-Sep-2020 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: remove no longer used dio_end_io()

Since we removed the last user of dio_end_io() when btrfs got converted
to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
the helper function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# b89999d0 02-Oct-2020 Scott Branden <scott.branden@broadcom.com>

fs/kernel_read_file: Split into separate include file

Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h
include file. That header gets pulled in just about everywhere
and doesn't really need functions not related to the general fs interface.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com
Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# 06e67b84 02-Oct-2020 Kees Cook <keescook@chromium.org>

fs/kernel_read_file: Remove FIRMWARE_EFI_EMBEDDED enum

The "FIRMWARE_EFI_EMBEDDED" enum is a "where", not a "what". It
should not be distinguished separately from just "FIRMWARE", as this
confuses the LSMs about what is being loaded. Additionally, there was
no actual validation of the firmware contents happening.

Fixes: e4c2c0ff00ec ("firmware: Add new platform fallback mechanism and firmware_request_platform()")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201002173828.2099543-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# c307459b 02-Oct-2020 Kees Cook <keescook@chromium.org>

fs/kernel_read_file: Remove FIRMWARE_PREALLOC_BUFFER enum

FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
that are interested in filtering between types of things. The "how"
should be an internal detail made uninteresting to the LSMs.

Fixes: a098ecd2fa7d ("firmware: support loading into a pre-allocated buffer")
Fixes: fd90bc559bfb ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
Fixes: 4f0496d8ffa3 ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# bfdc5970 24-Sep-2020 Christoph Hellwig <hch@lst.de>

iov_iter: refactor rw_copy_check_uvector and import_iovec

Split rw_copy_check_uvector into two new helpers with more sensible
calling conventions:

- iovec_from_user copies a iovec from userspace either into the provided
stack buffer if it fits, or allocates a new buffer for it. Returns
the actually used iovec. It also verifies that iov_len does fit a
signed type, and handles compat iovecs if the compat flag is set.
- __import_iovec consolidates the native and compat versions of
import_iovec. It calls iovec_from_user, then validates each iovec
actually points to user addresses, and ensures the total length
doesn't overflow.

This has two major implications:

- the access_process_vm case loses the total lenght checking, which
wasn't required anyway, given that each call receives two iovecs
for the local and remote side of the operation, and it verifies
the total length on the local side already.
- instead of a single loop there now are two loops over the iovecs.
Given that the iovecs are cache hot this doesn't make a major

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ce71bfea 31-Aug-2020 Jens Axboe <axboe@kernel.dk>

fs: align IOCB_* flags with RWF_* flags

We have a set of flags that are shared between the two and inherired
in kiocb_set_rw_flags(), but we check and set these individually.
Reorder the IOCB flags so that the bottom part of the space is synced
with the RWF flag space, and then we can do them all in one mask and
set operation.

The only exception is RWF_SYNC, which needs to mark IOCB_SYNC and
IOCB_DSYNC. Do that one separately.

This shaves 15 bytes of text from kiocb_set_rw_flags() for me.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# a3ec6005 18-Sep-2020 Jens Axboe <axboe@kernel.dk>

io_uring: move io_uring_get_socket() into io_uring.h

Now we have a io_uring kernel header, move this definition out of fs.h
and into io_uring.h where it belongs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 09f1bde4 26-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: move vfs_fstatat out of line

This allows to keep vfs_statx static in fs/stat.c to prepare for the following

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 0b2c6693 26-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat

Go through vfs_fstatat instead of duplicating the *stat to statx mapping
three times.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# da9aa5d9 26-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: remove vfs_statx_fd

vfs_statx_fd is only used to implement vfs_fstat. Remove vfs_statx_fd
and just implement vfs_fstat directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1cb039f3 24-Sep-2020 Christoph Hellwig <hch@lst.de>

bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 402dd2cf 24-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: remove the unused SB_I_MULTIROOT flag

The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28
("NFS: Add fs_context support.")

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# c843843e 08-Jul-2020 Daniel Rosenberg <drosen@google.com>

fs: Add standard casefolding support

This adds general supporting functions for filesystems that use
utf8 casefolding. It provides standard dentry_operations and adds the
necessary structures in struct super_block to allow this standardization.

The new dentry operations are functionally equivalent to the existing
operations in ext4 and f2fs, apart from the use of utf8_casefold_hash to
avoid an allocation.

By providing a common implementation, all users can benefit from any
optimizations without needing to port over improvements.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

# 36e2c742 03-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: don't allow splice read/write without explicit ops

default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers. It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method. The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files. If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 34ae204f 11-Aug-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.

Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.

Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate

To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# efa8480a 10-Aug-2020 Jens Axboe <axboe@kernel.dk>

fs: RWF_NOWAIT should imply IOCB_NOIO

With the change allowing read-ahead for IOCB_NOWAIT, we changed the
RWF_NOWAIT semantics of only doing cached reads. Since we know have
IOCB_NOIO to manage that specific side of it, just make RWF_NOWAIT
imply IOCB_NOIO as well to restore the previous behavior.

Fixes: 2e85abf053b9 ("mm: allow read-ahead with IOCB_NOWAIT set")
Reported-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 45e55300 07-Aug-2020 Peter Collingbourne <pcc@google.com>

mm: remove unnecessary wrapper function do_mmap_pgoff()

The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.

The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().

Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# e809d5f0 07-Aug-2020 Chris Down <chris@chrisdown.name>

tmpfs: per-superblock i_ino support

Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even having
as many as 150 duplicated inode numbers with differing file content.

This causes actual, tangible problems in production. For example, we have
complaints from those working on remote caches that their application is
reporting cache corruptions because it uses (device, inodenum) to
establish the identity of a particular cache object, but because it's not
unique any more, the application refuses to continue and reports cache
corruption. Even worse, sometimes applications may not even detect the
corruption but may continue anyway, causing phantom and hard to debug

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough. One
might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
(ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. Moving inum generation from being global to per-sb for tmpfs. This
itself allows some reduction in i_ino churn. This works on both 64-
and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
64-bit ino_t only: we allow users to mount tmpfs with a new inode64
option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

You can see how this compares to previous related patches which didn't
implement this per-superblock:

- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/

This patch (of 2):

get_next_ino has a number of problems:

- It uses and returns a uint, which is susceptible to become overflowed
if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
means it's not that difficult to cause inode number wraparounds on a
single device, which can result in having multiple distinct inodes
with the same inode number.

This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size per-device,
for example, allowing users to choose whether to use 32- or 64-bit inodes
for each tmpfs mount. This is implemented in the next commit.

For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.

Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c9dff084 09-Jul-2020 Andreas Gruenbacher <agruenba@redhat.com>

fs: Fix typo in comment

The comment for function filemap_check_wb_err accidentally refers to
it as filemap_check_wb_error.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

# 1752f0ad 01-Aug-2020 Pavel Begunkov <asml.silence@gmail.com>

fs: optimise kiocb_set_rw_flags()

Use a local var to collect flags in kiocb_set_rw_flags(). That spares
some memory writes and allows to replace most of the jumps with MOVEcc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# fd5ad30c 15-Jul-2020 Christoph Hellwig <hch@lst.de>

fs: expose utimes_common

Rename utimes_common to vfs_utimes and make it available outside of
utimes.c. This will be used by the initramfs unpacking code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

# 6414e9b0 12-Jul-2020 Eric Biggers <ebiggers@google.com>

fs: define inode flags using bit numbers

Define the VFS inode flags using bit numbers instead of hardcoding
powers of 2, which has become unwieldy now that we're up to 65536.

No change in the actual values.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9e96c8c0 14-Jul-2020 Christoph Hellwig <hch@lst.de>

fs: add a vfs_fchmod helper

Add a helper for struct file based chmode operations. To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

# c04011fe 14-Jul-2020 Christoph Hellwig <hch@lst.de>

fs: add a vfs_fchown helper

Add a helper for struct file based chown operations. To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

# 457e7a13 01-Jul-2020 Satya Tangirala <satyat@google.com>

fs: introduce SB_INLINECRYPT

Introduce SB_INLINECRYPT, which is set by filesystems that wish to use
blk-crypto for file content en/decryption. This flag maps to the
'-o inlinecrypt' mount option which multiple filesystems will implement,
and code in fs/crypto/ needs to be able to check for this mount option
in a filesystem-independent way.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-2-satyat@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>

# 775802c0 08-May-2020 Christoph Hellwig <hch@lst.de>

fs: remove __vfs_read

Fold it into the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>

# 61a707c5 08-May-2020 Christoph Hellwig <hch@lst.de>

fs: add a __kernel_read helper

This is the counterpart to __kernel_write, and skip the rw_verify_area
call compared to kernel_read.

Signed-off-by: Christoph Hellwig <hch@lst.de>

# 41da51bc 21-Nov-2019 Andreas Gruenbacher <agruenba@redhat.com>

fs: Add IOCB_NOIO flag for generic_file_read_iter

Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
shouldn't trigger any filesystem I/O for the actual request or for
readahead. This allows to do tentative reads out of the page cache as
some filesystems allow, and to take the appropriate locks and retry the
reads only if the requested pages are not cached.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

# 21b9cb34 23-Jun-2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

fs: fs.h: fix a kernel-doc parameter description

Changeset 3b0311e7ca71 ("vfs: track per-sb writeback errors and report them to syncfs")
added a variant of filemap_sample_wb_err(), but it forgot to
rename the arguments at the kernel-doc markup. Fix it.

Fix those warnings:
./include/linux/fs.h:2845: warning: Function parameter or member 'file' not described in 'file_sample_sb_err'
./include/linux/fs.h:2845: warning: Excess function parameter 'mapping' description in 'file_sample_sb_err'

Fixes: 3b0311e7ca71 ("vfs: track per-sb writeback errors and report them to syncfs")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/7b33bbceb29ac80874622a2bc84127bb10103245.1592895969.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

# 621c1f42 20-Jun-2020 Christoph Hellwig <hch@lst.de>

block: move struct block_device to blk_types.h

Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 3f1266f1 20-Jun-2020 Christoph Hellwig <hch@lst.de>

block: move block-related definitions out of fs.h

Move most of the block related definition out of fs.h into more suitable

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# dd0dca22 20-Jun-2020 Christoph Hellwig <hch@lst.de>

block: simplify sb_is_blkdev_sb

Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 75362a17 20-Jun-2020 Christoph Hellwig <hch@lst.de>

fs: remove the mount_bdev and kill_block_super stubs

No one calls these functions without CONFIG_BLOCK, so don't bother
stubbing them out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 4e24566a 20-Jun-2020 Christoph Hellwig <hch@lst.de>

fs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL defines

These are not defined anywhere, and contrary to the comments we really
do not care about out of tree code at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 7dbac5ba 20-Jun-2020 Christoph Hellwig <hch@lst.de>

fs: remove an unused block_device_operations forward declaration

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 764b23bd 20-Jun-2020 Christoph Hellwig <hch@lst.de>

block: mark bd_finish_claiming static

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# b818f09e 20-Jun-2020 Christoph Hellwig <hch@lst.de>

tty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCK

We can also thaw non-block file systems. Remove the CONFIG_BLOCK in
sysrq.c after making the prototype available unconditionally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# c2a25ec0 22-May-2020 Jens Axboe <axboe@kernel.dk>


If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

# dd3e6d50 22-May-2020 Jens Axboe <axboe@kernel.dk>

mm: add support for async page locking

Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 3373a346 17-Jun-2020 Zheng Bin <zhengbin13@huawei.com>

block: make function 'kill_bdev' static

kill_bdev does not have any external user, so make it static.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 5fcd5750 29-May-2020 Jan Kara <jack@suse.cz>

writeback: Drop I_DIRTY_TIME_EXPIRE

The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>

# 5afced3b 29-May-2020 Jan Kara <jack@suse.cz>

writeback: Avoid skipping inode writeback

Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.

However there are two flaws in the current logic:

1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).

2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:

writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES)
inode->i_state |= I_SYNC
inode->i_state |= I_DIRTY_PAGES;
if (inode->i_state & I_SYNC)
if (!(inode->i_state & I_DIRTY_ALL))
- not true so nothing done

We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this

Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.

Reported-by: Martijn Coenen <maco@android.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Tested-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>

# f1084bc6 09-Jun-2020 David Sterba <dsterba@suse.com>

Revert "fs: remove dio_end_io()"

This reverts commit b75b7ca7c27dfd61dba368f390b7d4dc20b3a8cb.

The patch restores a helper that was not necessary after direct IO port
to iomap infrastructure, which gets reverted.

Signed-off-by: David Sterba <dsterba@suse.com>

# c1e8d7c6 08-Jun-2020 Michel Lespinasse <walken@google.com>

mmap locking API: convert mmap_sem comments

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 10c5db28 23-May-2020 Christoph Hellwig <hch@lst.de>

fs: move the fiemap definitions out of fs.h

No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 44ebcd06 23-May-2020 Christoph Hellwig <hch@lst.de>

fs: mark __generic_block_fiemap static

There is no caller left outside of ioctl.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 8151b4c8 01-Jun-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: add readahead address space operation

This replaces ->readpages with a saner interface:
- Return void instead of an ignored error code.
- Page cache is already populated with locked pages when ->readahead
is called.
- New arguments can be passed to the implementation without changing
all the filesystems that use a common helper function like

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-12-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 735e4ae5 01-Jun-2020 Jeff Layton <jlayton@kernel.org>

vfs: track per-sb writeback errors and report them to syncfs

Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 3f19b2ab 01-Dec-2017 David Howells <dhowells@redhat.com>

vfs, afs, ext4: Make the inode hash table RCU searchable

Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.

The main thing this requires is some RCU annotation on the list
manipulation operations. Inodes are already freed by RCU in most cases.

Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.

There are at least three instances where this can be of use:

(1) Testing whether the inode number iunique() is going to return is
currently unique (the iunique_lock is still held).

(2) Ext4 date stamp updating.

(3) AFS callback breaking.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org

# b75b7ca7 25-Nov-2019 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: remove dio_end_io()

Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# d85dc2e1 29-Aug-2019 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: export generic_file_buffered_read()

Export generic_file_buffered_read() to be used to supplement incomplete
direct reads.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

# 3783daeb 19-May-2020 Christoph Hellwig <hch@lst.de>

block: remove ioctl_by_bdev

No callers left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# a3c751a5 14-May-2020 Miklos Szeredi <mszeredi@redhat.com>

vfs: allow unprivileged whiteout creation

Whiteouts, unlike real device node should not require privileges to create.

The general concern with device nodes is that opening them can have side
effects. The kernel already avoids zero major (see
Documentation/admin-guide/devices.txt). To be on the safe side the patch
explicitly forbids registering a char device with 0/0 number (see

This guarantees that a non-O_PATH open on a whiteout will fail with ENODEV;
i.e. it won't have any side effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 2c567af4 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>


DCACHE_DONTCACHE indicates a dentry should not be cached on final

Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.

This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# dae2f8ed 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>

fs: Lift XFS_IDONTCACHE to the VFS layer

DAX effective mode (S_DAX) changes requires inode eviction.

XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the
inode if no other additional references are taken. We lift this flag to
the VFS layer and change the behavior slightly by allowing the flag to
remain even if multiple references are taken.

This will expedite the eviction of inodes to change S_DAX.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 32b1924b2 09-Apr-2020 Konstantin Khlebnikov <koct9i@gmail.com>

ovl: skip overlayfs superblocks at global sync

Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs() for
backend although it itself is in global list of superblocks too. As a
result one syscall sync() could write one superblock several times and send
multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 28df3d15 28-Jul-2017 J. Bruce Fields <bfields@redhat.com>

nfsd: clients don't need to break their own delegations

We currently revoke read delegations on any write open or any operation
that modifies file data or metadata (including rename, link, and
unlink). But if the delegation in question is the only read delegation
and is held by the client performing the operation, that's not really

It's not always possible to prevent this in the NFSv4.0 case, because
there's not always a way to determine which client an NFSv4.0 delegation
came from. (In theory we could try to guess this from the transport
layer, e.g., by assuming all traffic on a given TCP connection comes
from the same client. But that's not really correct.)

In the NFSv4.1 case the session layer always tells us the client.

This patch should remove such self-conflicts in all cases where we can
reliably determine the client from the compound.

To do that we need to track "who" is performing a given (possibly
lease-breaking) file operation. We're doing that by storing the
information in the svc_rqst and using kthread_data() to map the current
task back to a svc_rqst.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>

# 9d82973e 04-May-2020 Linus Torvalds <torvalds@linux-foundation.org>

gcc-10 warnings: fix low-hanging fruit

Due to a bug-report that was compiler-dependent, I updated one of my
machines to gcc-10. That shows a lot of new warnings. Happily they
seem to be mostly the valid kind, but it's going to cause a round of
churn for getting rid of them..

This is the really low-hanging fruit of removing a couple of zero-sized
arrays in some core code. We have had a round of these patches before,
and we'll have many more coming, and there is nothing special about
these except that they were particularly trivial, and triggered more
warnings than most.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# efbe3c24 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>

fs: Remove unneeded IS_DAX() check in io_is_direct()

Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own. Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Also remove io_is_direct() as it is just a simple flag check.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9bc5c397 14-Apr-2020 Christoph Hellwig <hch@lst.de>

block: fold bdev_unhash_inode into invalidate_partition

invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode. Piggy back on that to remove the inode from the hash.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 02d33b67 14-Apr-2020 Christoph Hellwig <hch@lst.de>

block: mark invalidate_partition static

invalidate_partition is only used in genhd.c, so mark it static. Also
drop the return value given that is is always ignored.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# c0d0381a 01-Apr-2020 Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.

hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# f05a3849 24-Mar-2020 Thomas Hellstrom (VMware) <thomas_os@shipmail.org>

fs: Constify vma argument to vma_is_dax

The function is used by upcoming vma_is_special_huge() with which we want
to use a const vma argument. Since for vma_is_dax() the vma argument is
only dereferenced for reading, constify it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Acked-by: Christian König <christian.koenig@amd.com>

# ea3edd4d 24-Mar-2020 Christoph Hellwig <hch@lst.de>

block: remove __bdevname

There is no good reason for __bdevname to exist. Just open code
printing the string in the callers. For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.

Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# e4c2c0ff 15-Jan-2020 Hans de Goede <hdegoede@redhat.com>

firmware: Add new platform fallback mechanism and firmware_request_platform()

In some cases the platform's main firmware (e.g. the UEFI fw) may contain
an embedded copy of device firmware which needs to be (re)loaded into the
peripheral. Normally such firmware would be part of linux-firmware, but in
some cases this is not feasible, for 2 reasons:

1) The firmware is customized for a specific use-case of the chipset / use
with a specific hardware model, so we cannot have a single firmware file
for the chipset. E.g. touchscreen controller firmwares are compiled
specifically for the hardware model they are used with, as they are
calibrated for a specific model digitizer.

2) Despite repeated attempts we have failed to get permission to
redistribute the firmware. This is especially a problem with customized
firmwares, these get created by the chip vendor for a specific ODM and the
copyright may partially belong with the ODM, so the chip vendor cannot
give a blanket permission to distribute these.

This commit adds a new platform fallback mechanism to the firmware loader
which will try to lookup a device fw copy embedded in the platform's main
firmware if direct filesystem lookup fails.

Drivers which need such embedded fw copies can enable this fallback
mechanism by using the new firmware_request_platform() function.

Note that for now this is only supported on EFI platforms and even on
these platforms firmware_fallback_platform() only works if
CONFIG_EFI_EMBEDDED_FIRMWARE is enabled (this gets selected by drivers
which need this), in all other cases firmware_fallback_platform() simply
always returns -ENOENT.

Reported-by: Dave Olsthoorn <dave@bewaar.me>
Suggested-by: Peter Jones <pjones@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20200115163554.101315-5-hdegoede@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# 8019ad13 04-Mar-2020 Peter Zijlstra <peterz@infradead.org>

futex: Fix inode life-time issue

As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

# 901cff7c 23-Jan-2020 Topi Miettinen <toiwoton@gmail.com>

firmware_loader: load files from the mount namespace of init

I have an experimental setup where almost every possible system
service (even early startup ones) runs in separate namespace, using a
dedicated, minimal file system. In process of minimizing the contents
of the file systems with regards to modules and firmware files, I
noticed that in my system, the firmware files are loaded from three
different mount namespaces, those of systemd-udevd, init and
systemd-networkd. The logic of the source namespace is not very clear,
it seems to depend on the driver, but the namespace of the current
process is used.

So, this patch tries to make things a bit clearer and changes the
loading of firmware files only from the mount namespace of init. This
may also improve security, though I think that using firmware files as
attack vector could be too impractical anyway.

Later, it might make sense to make the mount namespace configurable,
for example with a new file in /proc/sys/kernel/firmware_config/. That
would allow a dedicated file system only for firmware files and those
need not be present anywhere else. This configurability would make
more sense if made also for kernel modules and /sbin/modprobe. Modules
are already loaded from init namespace (usermodehelper uses kthreadd
namespace) except when directly loaded by systemd-udevd.

Instead of using the mount namespace of the current process to load
firmware files, use the mount namespace of init process.

Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/
Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/
Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# d7167b14 07-Sep-2019 Al Viro <viro@zeniv.linux.org.uk>

fs_parse: fold fs_parameter_desc/fs_parameter_spec

The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 30460e1e 09-Jan-2020 Carlos Maiolino <cmaiolino@redhat.com>

fs: Enable bmap() function to properly return errors

By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.

It will change the behavior of bmap() on return:

- negative value in case of error
- zero on success or map fell into a hole

In case of a hole, the *block will be zero too

Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ddf8f376 30-Jan-2020 Ira Weiny <ira.weiny@intel.com>

mm/filemap.c: clean up filemap_write_and_wait()

At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()

Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()

Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 5dcdc43e 20-Nov-2019 Jiufei Xue <jiufei.xue@linux.alibaba.com>

vfs: add vfs_iocb_iter_[read|write] helper functions

This doesn't cause any behavior changes and will be used by overlay async
IO implementation.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 77b90401 27-Nov-2019 Arnd Bergmann <arnd@arndb.de>

compat_ioctl: simplify the implementation

Now that both native and compat ioctl syscalls are
in the same file, a couple of simplifications can
be made, bringing the implementation closer together:

- do_vfs_ioctl(), ioctl_preallocate(), and compat_ioctl_preallocate()
can become static, allowing the compiler to optimize better

- slightly update the coding style for consistency between
the functions.

- rather than listing each command in two switch statements
for the compat case, just call a single function that has
all the common commands.

As a side-effect, FS_IOC_RESVSP/FS_IOC_RESVSP64 are now available
to x86 compat tasks, along with FS_IOC_RESVSP_32/FS_IOC_RESVSP64_32.
This is harmless for i386 emulation, and can be considered a bugfix
for x32 emulation, which never supported these in the past.

Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

# a3d1e7eb 18-Nov-2019 Al Viro <viro@zeniv.linux.org.uk>

simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems

two requirements: no file creations in IS_DEADDIR and no cross-directory
renames whatsoever.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ba70609d 02-Dec-2019 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Delete timespec64_trunc()

There are no more callers to the function remaining.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 2496396f 15-Oct-2019 Thomas Gleixner <tglx@linutronix.de>

sched/rt, fs: Use CONFIG_PREEMPTION

Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the i_size() and part_nr_sects_…() code over to use
CONFIG_PREEMPTION. Update the comment for fsstack_copy_inode_size() also

[bigeasy: +PREEMPT comments]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-24-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>

# a92853b6 30-Nov-2019 Konstantin Khlebnikov <koct9i@gmail.com>

fs/direct-io.c: keep dio_warn_stale_pagecache() when CONFIG_BLOCK=n

This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.

See also commit 5a9d929d6e13 ("iomap: report collisions between directio
and buffered writes to userspace").

Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 2be7d348 26-Nov-2019 Linus Torvalds <torvalds@linux-foundation.org>

Revert "vfs: properly and reliably lock f_pos in fdget_pos()"

This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads. Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service. In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor. Which doesn't work well for

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side). It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

/dev/fuse (fuse_dev_operations)
/dev/mcelog (mce_chrdev_ops)
/dev/mei0 (mei_fops)
/dev/net/tun (tun_fops)
/dev/nvme0 (nvme_dev_fops)
/dev/tpm0 (tpm_fops)
/proc/self/ns/mnt (ns_file_operations)
/dev/snd/pcm* (snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM

Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0be0ee71 11-Nov-2019 Linus Torvalds <torvalds@linux-foundation.org>

vfs: properly and reliably lock f_pos in fdget_pos()

fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file. Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it. With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()". We've done that for all the obvious and common
situations, we may need a few more. Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

"And I appreciate if people could help at least somehow with "getting
rid of mixed case entirely" (i.e. always lock f_pos_lock on
!FMODE_STREAM), because this transition starts to diverge from my
particular use-case too far. To me it makes sense to do that
transition as follows:

- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that
truly don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
will cover pipes and sockets), or maybe convert pipes and sockets
to FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use
no_llseek or noop_llseek, but do not use nonseekable_open or
alloc_file_pseudo. This might find files that have stream semantic
but are opened differently;
- extend stream_open.cocci to analyze file_operations whose
.read/.write do not use ppos at all (independently of how file was
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
- gather bug reports for deadlocked read/write and convert missed
cases to FMODE_STREAM, probably extending stream_open.cocci along
the road to catch similar cases

i.e. always take f_pos_lock unless a file is explicitly marked as
being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here. There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations. Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# f0b870df 14-Nov-2019 Christoph Hellwig <hch@lst.de>

block: remove (__)blkdev_reread_part as an exported API

In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# a1548b67 14-Nov-2019 Christoph Hellwig <hch@lst.de>

block: move rescan_partitions to fs/block_dev.c

Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 837a6e7f 24-Oct-2019 Christoph Hellwig <hch@lst.de>

fs: add generic UNRESVSP and ZERO_RANGE ioctl handlers

These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS

Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 2952db0f 11-Sep-2018 Arnd Bergmann <arnd@arndb.de>

compat_ioctl: add compat_ptr_ioctl()

Many drivers have ioctl() handlers that are completely compatible between
32-bit and 64-bit architectures, except for the argument that is passed
down from user space and may have to be passed through compat_ptr()
in order to become a valid 64-bit pointer.

Using ".compat_ptr = compat_ptr_ioctl" in file operations should let
us simplify a lot of those drivers to avoid #ifdef checks, and convert
additional drivers that don't have proper compat handling yet.

On most architectures, the compat_ptr_ioctl() just passes all arguments
to the corresponding ->ioctl handler. The exception is arch/s390, where
compat_ptr() clears the top bit of a 32-bit pointer value, so user space
pointers to the second 2GB alias the first 2GB, as is the case for native
32-bit s390 user space.

The compat_ptr_ioctl() function must therefore be used only with
ioctl functions that either ignore the argument or pass a pointer to a
compatible data type.

If any ioctl command handled by fops->unlocked_ioctl passes a plain
integer instead of a pointer, or any of the passed data types is
incompatible between 32-bit and 64-bit architectures, a proper handler
is required instead of compat_ptr_ioctl.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
v3: add a better description
v2: use compat_ptr_ioctl instead of generic_compat_ioctl_ptrarg,
as suggested by Al Viro

# 09d91cda 23-Sep-2019 Song Liu <songliubraving@fb.com>

mm,thp: avoid writes to file with THP in pagecache

In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.

Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# cf1ea059 29-Aug-2019 Jan Kara <jack@suse.cz>

fs: Export generic_fadvise()

Filesystems will need to call this function from their fadvise handlers.

CC: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 50e17c00 21-Jan-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: Add timestamp_truncate() api

timespec_trunc() function is used to truncate a
filesystem timestamp to the right granularity.
But, the function does not clamp tv_sec part of the
timestamps according to the filesystem timestamp limits.

The replacement api: timestamp_truncate() also alters the
signature of the function to accommodate filesystem
timestamp clamping according to flesystem limits.

Note that the tv_nsec part is set to 0 if tv_sec is not within
the range supported for the filesystem.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>

# 188d20bc 21-Jan-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: Add file timestamp range support

Add fields to the superblock to track the min and max
timestamps supported by filesystems.

Initially, when a superblock is allocated, initialize
it to the max and min values the fields can hold.
Individual filesystems override these to match their
actual limits.

Pseudo filesystems are assumed to always support the
min and max allowable values for the fields.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>

# dc617f29 20-Aug-2019 Darrick J. Wong <darrick.wong@oracle.com>

vfs: don't allow writes to swap files

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# 18f6622e 18-Aug-2019 Jeff Layton <jeff.layton@primarydata.com>

locks: create a new notifier chain for lease attempts

With the new file caching infrastructure in nfsd, we can end up holding
files open for an indefinite period of time, even when they are still
idle. This may prevent the kernel from handing out leases on the file,
which is something we don't want to block.

Fix this by running a SRCU notifier call chain whenever on any
lease attempt. nfsd can then purge the cache for that inode before

Since SRCU is only conditionally compiled in, we must only define the
new chain if it's enabled, and users of the chain must ensure that
SRCU is enabled.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

# 22d94f49 04-Aug-2019 Eric Biggers <ebiggers@google.com>

fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl

Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".

Why we need this

The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.

Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions. But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic. This is because it depends on
whether the files are currently present in the inode cache.

Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work. Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons. First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.

Second, almost all users of fscrypt actually do need the keys to be
global. The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process. This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.

On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2]. This is ugly and raises security concerns. Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).

By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.

Why not use the add_key() syscall

We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:

- Supporting key removal with the semantics such that the secret is
removed immediately and any unused inodes using the key are evicted;
also, the eviction of any in-use inodes can be retried.

- Calculating a key-dependent cryptographic identifier and returning it
to userspace.

- Allowing keys to be added and removed by non-root users, but only keys
for v2 encryption policies; and to prevent denial-of-service attacks,
users can only remove keys they themselves have added, and a key is
only really removed after all users who added it have removed it.

Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.

However, to reuse code the implementation still uses the keyrings
service internally. Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.


[1] https://github.com/google/fscrypt
[2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
[3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
[4] https://github.com/google/fscrypt/issues/116
[5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
[6] https://github.com/google/fscrypt/issues/128
[7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>

# 89e524c0 30-Jul-2019 Jan Kara <jack@suse.cz>

loop: Fix mount(2) failure due to race with LOOP_SET_FD

Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:

for i in {a..z}{a..z}; do
dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
mkfs.ext2 $i.image
mkdir mnt$i

echo "Run"
for i in {a..z}{a..z}; do
mount -o loop -t ext2 $i.image mnt$i &

Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.

Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 5585f2af 22-Jul-2019 Eric Biggers <ebiggers@google.com>

fs-verity: add inode and superblock fields

Analogous to fs/crypto/, add fields to the VFS inode and superblock for
use by the fs/verity/ support layer:

- ->s_vop: points to the fsverity_operations if the filesystem supports
fs-verity, otherwise is NULL.

- ->i_verity_info: points to cached fs-verity information for the inode
after someone opens it, otherwise is NULL.

- S_VERITY: bit in ->i_flags that identifies verity inodes, even when
they haven't been opened yet and thus still have NULL ->i_verity_info.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>

# f85d9338 23-May-2019 Benjamin Coddington <bcodding@redhat.com>

locks: Cleanup lm_compare_owner and lm_owner_key

After the update to use nlm_lockowners for the NLM server, there are no
more users of lm_compare_owner and lm_owner_key.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

# 7b0e492e 01-Jul-2019 Darrick J. Wong <darrick.wong@oracle.com>

vfs: create a generic checking function for FS_IOC_FSSETXATTR

Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>

# 5aca2842 01-Jul-2019 Darrick J. Wong <darrick.wong@oracle.com>

vfs: create a generic checking and prep function for FS_IOC_SETFLAGS

Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.

Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>

# aa0bfcd9 20-Jun-2019 Ross Zwisler <zwisler@chromium.org>

mm: add filemap_fdatawait_range_keep_errors()

In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.

Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org

# 387e3746 07-Jun-2019 Amir Goldstein <amir73il@gmail.com>

locks: eliminate false positive conflicts for write lease

check_conflicting_open() is checking for existing fd's open for read or
for write before allowing to take a write lease. The check that was
implemented using i_count and d_count is an approximation that has
several false positives. For example, overlayfs since v4.19, takes an
extra reference on the dentry; An open with O_PATH takes a reference on
the dentry although the file cannot be read nor written.

Change the implementation to use i_readcount and i_writecount to
eliminate the false positive conflicts and allow a write lease to be
taken on an overlayfs file.

The change of behavior with existing fd's open with O_PATH is symmetric
w.r.t. current behavior of lease breakers - an open with O_PATH currently
does not break a write lease.

This increases the size of struct inode by 4 bytes on 32bit archs when
CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

# e38f7f53 05-Jun-2019 Amir Goldstein <amir73il@gmail.com>

vfs: introduce file_modified() helper

The combination of file_remove_privs() and file_update_mtime() is
quite common in filesystem ->write_iter() methods.

Modelled after the helper file_accessed(), introduce file_modified()
and use it from generic_remap_file_range_prep().

Note that the order of calling file_remove_privs() before
file_update_mtime() in the helper was matched to the more common order by
filesystems and not the current order in generic_remap_file_range_prep().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 96e6e8f4 05-Jun-2019 Amir Goldstein <amir73il@gmail.com>

vfs: add missing checks to copy_file_range

Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# a3171351 05-Jun-2019 Amir Goldstein <amir73il@gmail.com>

vfs: introduce generic_file_rw_checks()

Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# f16acc9d 05-Jun-2019 Dave Chinner <dchinner@redhat.com>

vfs: introduce generic_copy_file_range()

Right now if vfs_copy_file_range() does not use any offload
mechanism, it falls back to calling do_splice_direct(). This fails
to do basic sanity checks on the files being copied. Before we
start adding this necessarily functionality to the fallback path,
separate it out into generic_copy_file_range().

generic_copy_file_range() has the same prototype as
->copy_file_range() so that filesystems can use it in their custom
->copy_file_range() method if they so choose.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# 5c437fa2 07-Jun-2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

docs: fs: fix broken links to vfs.txt with was renamed to vfs.rst

A recent documentation conversion renamed this file but forgot
to update the links.

Fixes: af96c1e304f7 ("docs: filesystems: vfs: Convert vfs.txt to RST")
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

# 0b3b094a 15-May-2019 Jan Kara <jack@suse.cz>

fanotify: Disallow permission events for proc filesystem

Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.

Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 023d066a 25-Mar-2019 David Howells <dhowells@redhat.com>

vfs: Kill sget_userns()

Kill sget_userns(), folding it into sget() as that's the only remaining

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org

# 8d9e46d8 11-May-2019 Al Viro <viro@zeniv.linux.org.uk>

fold mount_pseudo_xattr() into pseudo_fs_get_tree()

... now that all other callers are gone

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# bb7b6b2b 25-Mar-2019 David Howells <dhowells@redhat.com>

vfs: Kill mount_ns()

Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1f58bb18 20-May-2019 Al Viro <viro@zeniv.linux.org.uk>

mount_pseudo(): drop 'name' argument, switch to d_make_root()

Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 22f96b38 09-Apr-2019 Jens Axboe <axboe@kernel.dk>

fs: add sync_file_range() helper

This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

# fdb0da89 10-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

new inode method: ->free_inode()

A lot of ->destroy_inode() instances end with call_rcu() of a callback
that does RCU-delayed part of freeing. Introduce a new method for
doing just that, with saner signature.

->destroy_inode ->free_inode
f g immediate call of f(),
RCU-delayed call of g()
f NULL immediate call of f(),
no RCU-delayed calls
NULL g RCU-delayed call of g()
NULL NULL RCU-delayed default freeing

IOW, NULL ->free_inode gives the same behaviour as now.

Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
mandate the latter form, but that would have very little benefit beyond
making rules a bit more symmetric. It would break backwards compatibility,
require extra boilerplate and expected semantics for (NULL, NULL) pair
would have no use whatsoever...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# b886ee3e 25-Apr-2019 Gabriel Krisman Bertazi <krisman@collabora.co.uk>

ext4: Support case-insensitive file name lookups

This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.


- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# d4696601 25-Apr-2019 David Howells <dhowells@redhat.com>

afs: Add file locking tracepoints

Add two tracepoints for monitoring AFS file locking. Firstly, add one that
follows the operational part:

echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_op/enable

And add a second that more follows the event-driven part:

echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_ev/enable

Individual file_lock structs seen by afs are tagged with debugging IDs that
are displayed in the trace log to make it easier to see what's going on,
especially as setting the first lock always seems to involve copying the
file_lock twice.

Signed-off-by: David Howells <dhowells@redhat.com>

# 1caf7a70 10-Apr-2019 Lukas Bulwahn <lukas.bulwahn@gmail.com>

fs: drop unused fput_atomic definition

commit d7065da03822 ("get rid of the magic around f_count in aio") added
fput_atomic to include/linux/fs.h, motivated by its use in __aio_put_req()
in fs/aio.c.

Later, commit 3ffa3c0e3f6e ("aio: now fput() is OK from interrupt context;
get rid of manual delayed __fput()") removed the only use of fput_atomic
in __aio_put_req(), but did not remove the since then unused fput_atomic
definition in include/linux/fs.h.

We curate this now and finally remove the unused definition.

This issue was identified during a code review due to a coccinelle warning
from the atomic_as_refcounter.cocci rule pointing to the use of atomic_t
in fput_atomic.

Suggested-by: Krystian Radlak <kradlak@exida.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 10dce8af 26-Mar-2019 Kirill Smelkov <kirr@nexedi.com>

fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock

Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655e3 -
is doing so irregardless of whether a file is seekable or not.



for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some

kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:




Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:


I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:


Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.

It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers


so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).

This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a07b2000 05-Nov-2018 Al Viro <viro@zeniv.linux.org.uk>

vfs: syscall: Add open_tree(2) to reference or clone a mount

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# f1fffbd4 07-Mar-2019 Rasmus Villemoes <linux@rasmusvillemoes.dk>

linux/fs.h: move member alignment check next to definition of struct filename

Instead of doing this compile-time check in some slightly arbitrary user
of struct filename, put it next to the definition.

Link: http://lkml.kernel.org/r/20190208203015.29702-3-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a9519def 05-Mar-2019 Greg Thelen <gthelen@google.com>

writeback: fix inode cgroup switching comment

Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.

Switch the comments to inode_switch_wbs_work_fn().

Link: http://lkml.kernel.org/r/20190305004617.142590-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 84c4e1f8 03-Mar-2019 Linus Torvalds <torvalds@linux-foundation.org>

aio: simplify - and fix - fget/fput for io_submit()

Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:

"In more details - normally IOCB_CMD_POLL handling looks so:

1) io_submit(2) allocates aio_kiocb instance and passes it to

2) aio_poll() resolves the descriptor to struct file by req->file =

3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).

4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.

5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.

6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb

7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.

If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.

However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.

But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:

Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.

Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 091141a4 21-Nov-2018 Jens Axboe <axboe@kernel.dk>

fs: add fget_many() and fput_many()

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 2b188cc1 07-Jan-2019 Jens Axboe <axboe@kernel.dk>

Add io_uring IO interface

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# d911b458 01-Nov-2018 David Howells <dhowells@redhat.com>

vfs: Remove kern_mount_data()

The kern_mount_data() isn't used any more so remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# cb50b348 23-Dec-2018 Al Viro <viro@zeniv.linux.org.uk>

convenience helpers: vfs_get_super() and sget_fc()

the former is an analogue of mount_{single,nodev} for use in
->get_tree() instances, the latter - analogue of sget() for the

These are fairly similar to the originals, but the callback signature
for sget_fc() is different from sget() ones, so getting bits and
pieces shared would be too convoluted; we might get around to that
later, but for now let's just remember to keep them in sync. They
do live next to each other, and changes in either won't be hard
to spot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 3e1aeb00 01-Nov-2018 David Howells <dhowells@redhat.com>

vfs: Implement a filesystem superblock creation/configuration context

[AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living
[AV - reordering fs/namespace.c is badly overdue, but let's keep it
separate from that series]
[AV - drop simple_pin_fs() change]
[AV - clean vfs_kern_mount() failure exits up]

Implement a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.

The mounting procedure then becomes:

(1) Allocate new fs_context context.

(2) Configure the context.

(3) Create superblock.

(4) Query the superblock.

(5) Create a mount for the superblock.

(6) Destroy the context.

Rather than calling fs_type->mount(), an fs_context struct is created and
fs_type->init_fs_context() is called to set it up. Pointers exist for the
filesystem and LSM to hang their private data off.

A set of operations has to be set by ->init_fs_context() to provide
freeing, duplication, option parsing, binary data parsing, validation,
mounting and superblock filling.

Legacy filesystems are supported by the provision of a set of legacy
fs_context operations that build up a list of mount options and then invoke
fs_type->mount() from within the fs_context ->get_tree() operation. This
allows all filesystems to be accessed using fs_context.

It should be noted that, whilst this patch adds a lot of lines of code,
there is quite a bit of duplication with existing code that can be
eliminated should all filesystems be converted over.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# fb7e1600 22-Nov-2018 Christoph Hellwig <hch@lst.de>

fs: add an iopoll method to struct file_operations

This new methods is used to explicitly poll for I/O completion for an
iocb. It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# f3a09c92 23-Dec-2018 Al Viro <viro@zeniv.linux.org.uk>

introduce fs_context methods

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 8d0347f6 04-Nov-2018 David Howells <dhowells@redhat.com>

convert do_remount_sb() to fs_context

Replace do_remount_sb() with a function, reconfigure_super(), that's
fs_context aware. The fs_context is expected to be parameterised already
and have ->root pointing to the superblock to be reconfigured.

A legacy wrapper is provided that is intended to be called from the
fs_context ops when those appear, but for now is called directly from
reconfigure_super(). This wrapper invokes the ->remount_fs() superblock op
for the moment. It is intended that the remount_fs() op will be phased

The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
that the context is being used for reconfiguration.

do_umount_root() is provided to consolidate remount-to-R/O for umount and
emergency remount by creating a context and invoking reconfiguration.

do_remount(), do_umount() and do_emergency_remount_callback() are switched
to use the new process.

[AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
umount / bug, gets rid of pointless complexity]
[AV -- set ->net_ns in all cases; nfs remount will need that]
[AV -- shift security_sb_remount() call into reconfigure_super(); the callers
that didn't do security_sb_remount() have NULL fc->security anyway, so it's
a no-op for them]

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# a0c9a8b8 04-Nov-2018 Al Viro <viro@zeniv.linux.org.uk>

teach vfs_get_tree() to handle subtype, switch do_new_mount() to it

Roll the handling of subtypes into do_new_mount() and vfs_get_tree(). The
former determines any subtype string and hangs it off the fs_context; the
latter applies it.

Make do_new_mount() create, parameterise and commit an fs_context and
create a mount for itself rather than calling vfs_kern_mount().

[AV -- missing kstrdup()]
[AV -- ... and no kstrdup() if we get to setting ->s_submount - we
simply transfer it from fc, leaving NULL behind]
[AV -- constify ->s_submount, while we are at it]

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 7d10f70f 30-Jan-2019 Waiman Long <longman@redhat.com>

fs: Don't need to put list_lru into its own cacheline

The list_lru structure is essentially just a pointer to a table of
per-node LRU lists. Even if CONFIG_MEMCG_KMEM is defined, the list
field is just used for LRU list registration and shrinker_id is set at
initialization. Those fields won't need to be touched that often.

So there is no point to make the list_lru structures to sit in their own

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 643fa961 12-Dec-2018 Chandan Rajendra <chandan@linux.vnet.ibm.com>

fscrypt: remove filesystem specific build config option

In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>

# bbe7449e 20-Jan-2019 Phillip Potter <phil@philpotter.co.uk>

fs: common implementation of file type

Many file systems use a copy&paste implementation
of dirent to on-disk file type conversions.

Create a common implementation to be used by file systems
with some useful conversion helpers to reduce open coded
file type conversions in file system code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>

# 89cb0888 28-Dec-2018 Jan Kara <jack@suse.cz>

mm: migrate: provide buffer_migrate_page_norefs()

Provide a variant of buffer_migrate_page() that also checks whether there
are no unexpected references to buffer heads. This function will then be
safe to use for block device pages.

[akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# cb03f94f 29-Nov-2018 NeilBrown <neilb@suse.com>

fs/locks: merge posix_unblock_lock() and locks_delete_block()

posix_unblock_lock() is not specific to posix locks, and behaves
nearly identically to locks_delete_block() - the former returning a
status while the later doesn't.

So discard posix_unblock_lock() and use locks_delete_block() instead,
after giving that function an appropriate return value.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

# ada5c1da 29-Nov-2018 NeilBrown <neilb@suse.com>

fs/locks: rename some lists and pointers.

struct file lock contains an 'fl_next' pointer which
is used to point to the lock that this request is blocked
waiting for. So rename it to fl_blocker.

The fl_blocked list_head in an active lock is the head of a list of
blocked requests. In a request it is a node in that list.
These are two distinct uses, so replace with two list_heads
with different names.
fl_blocked_requests is the head of a list of blocked requests
fl_blocked_member is a node in a member of that list.

The two different list_heads are never used at the same time, but that
will change in a future patch.

Note that a tracepoint is changed to report fl_blocker instead
of fl_next.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

# 20578bdf 19-Nov-2018 Damien Le Moal <damien.lemoal@wdc.com>

block: Initialize BIO I/O priority early

For the synchronous I/O path case (read(), write() etc system calls), a
BIO I/O priority is not initialized until the execution of
blk_init_request_from_bio() when the BIO is submitted and a request
initialized for the BIO execution. This is due to the ki_ioprio field of
the struct kiocb defined on stack being always initialized to
IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio
value set with ioprio_set(). This late initialization can result in the
BIO being merged to pending requests even when the I/O priorities

Fix this by initializing the ki_iopriority field of on stack struct
kiocb using the get_current_ioprio() helper, ensuring that all BIOs
allocated and submitted for the system call execution see the correct
intended I/O priority early. With this, since a BIO I/O priority is
always set to the intended effective value for both the sync and async
path, blk_init_request_from_bio() can be simplified.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# c32e5f39 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: hide file range comparison function

There are no callers of vfs_dedupe_file_range_compare, so we might as
well make it a static helper and remove the export.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# eca3654e 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: enable remap callers that can handle short operations

Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it. Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# df365836 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: plumb remap flags through the vfs dedupe functions

Plumb a remap_flags argument through the vfs_dedupe_file_range_one
functions so that dedupe can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 452ce659 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: plumb remap flags through the vfs clone functions

Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 42ec3d4c 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: make remap_file_range functions take and return bytes completed

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 3d28193e 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: pass remap flags to generic_remap_checks

Pass the same remap flags to generic_remap_checks for consistency.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# a91ae49b 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: pass remap flags to generic_remap_file_range_prep

Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 2e5dfc99 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: combine the clone and dedupe into a single remap_file_range

Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# a83ab01a 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: rename vfs_clone_file_prep to be more descriptive

The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only. Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 1383a7ed 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: check file ranges before cloning files

Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location. This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 721fb6fb 17-Oct-2018 Jan Kara <jack@suse.cz>

fsnotify: Fix busy inodes during unmount

Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:

inode = fsnotify_detach_connector_from_object(conn)
sees connector detached for inode
-> nothing to do
barfs on pending inode reference

Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.

Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:

Avg Stddev
Vanilla 9.817400 0.276165
Fixed 9.710467 0.228294

So there's no statistically significant difference.

Fixes: 6b3f05d24d35 ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>

# ff9c745b 22-Nov-2017 Matthew Wilcox <willy@infradead.org>

mm: Convert page-writeback to XArray

Includes moving mapping_tagged() to fs.h as a static inline, and
changing it to return bool.

Signed-off-by: Matthew Wilcox <willy@infradead.org>

# eb797a8e 05-Mar-2018 Matthew Wilcox <willy@infradead.org>

page cache: Rearrange address_space

Change i_pages from a radix_tree_root to an xarray, convert the
documentation into kernel-doc format and change the order of the elements
to pack them better on 64-bit systems.

Signed-off-by: Matthew Wilcox <willy@infradead.org>

# 99c228a9 18-Oct-2018 Amir Goldstein <amir73il@gmail.com>

fs: group frequently accessed fields of struct super_block together

Kernel test robot reported [1] a 6% performance regression in a
concurrent unlink(2) workload on commit 60f7ed8c7c4d ("fsnotify: send
path type events to group with super block marks").

The performance test was run with no fsnotify marks at all on the
data set, so the only extra instructions added by the offending
commit are tests of the super_block fields s_fsnotify_{marks,mask}
and these tests happen on almost every single inode access.

When adding those fields to the super_block struct, we did not give much
thought of placing them on a hot cache lines (we just placed them at the
end of the struct).

Re-organize struct super_block to try and keep some frequently accessed
fields on the same cache line.

Move the frequently accessed fields s_fsnotify_{marks,mask} near the
frequently accessed fields s_fs_info,s_time_gran, while filling a 64bit
alignment hole after s_time_gran.

Move the seldom accessed fields s_id,s_uuid,s_max_links,s_mode near the
seldom accessed fields s_vfs_rename_mutex,s_subtype.

Rong Chen confirmed that this patch solved the reported problem.

[1] https://lkml.org/lkml/2018/9/30/206

Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 1e6cb72399 ("fsnotify: add super block object type")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# a725356b 18-Sep-2018 Amir Goldstein <amir73il@gmail.com>

vfs: swap names of {do,vfs}_clone_file_range()

Commit 031a072a0b8a ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.

The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.

It seems that commit 8ede205541ff ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.

Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 1e6cb723 01-Sep-2018 Amir Goldstein <amir73il@gmail.com>

fsnotify: add super block object type

Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 45cd0faa 27-Aug-2018 Amir Goldstein <amir73il@gmail.com>

vfs: add the fadvise() file operation

This is going to be used by overlayfs and possibly useful
for other filesystems.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 30aba665 23-Aug-2018 Salvatore Mesoraca <s.mesoraca16@gmail.com>

namei: allow restricted O_CREAT of FIFOs and regular files

Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:


This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 74c8164e 17-Aug-2018 Jens Axboe <axboe@kernel.dk>

mpage: mpage_readpages() should submit IO as read-ahead

a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this

Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.

Add some documentation notes on ->readpages() being purely for

Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 4cdfffc8 17-Aug-2018 NeilBrown <neilb@suse.com>

vfs: discard ATTR_ATTR_FLAG

This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.

Let's discard it properly.

Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 5bef9151 29-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

new helper: inode_fake_hash()

open-coded in a quite a few places...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# c2b6d621 28-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

new primitive: discard_new_inode()

We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction. However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.

Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.

iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.

[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# fb16043b 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove open_flags from d_real()

Opening regular files on overlayfs is now handled via ovl_open(). Remove
the now unused "open_flags" argument from d_op->d_real() and the d_real()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# de2a4a50 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

Partially revert "locks: fix file locking on overlayfs"

This partially reverts commit c568d68341be7030f5647def68851e469b21ca11.

Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.

It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode(). We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.

Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace
API for some time (though not in a useful way). Will try to remove
internal flags later when the dust around the new mount API settles.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>

# 4ab30319 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

Revert "vfs: add flags to d_real()"

This reverts commit 495e642939114478a5237a7d91661ba93b76f15a.

No user of "flags" argument of d_real() remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# c6718543 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

Revert "vfs: update ovl inode before relatime check"

This reverts commit 598e3c8f72f5b77c84d2cb26cfd936ffb3cfdbaa.

Overlayfs no longer relies on the vfs correct atime handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# f1825366 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: export vfs_dedupe_file_range_one() to modules

This is needed by the stacked dedupe implementation in overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 9df6702a 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: export vfs_ioctl() to modules

This is needed by the stacked ioctl implementation in overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# d3b1084d 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: make open_with_fake_path() not contribute to nr_files

Stacking file operations in overlay will store an extra open file for each
overlay file opened.

The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.

For fear of breaking working setups, don't start accounting the extra file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 2abc77af 12-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

new helper: open_with_fake_path()

open a file by given inode, faking ->f_path. Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question. Again, this is no general-purpose interface and I hope
it will eventually go away. Right now overlayfs wants something
like that, but nothing else should.

Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# dbae8f2c 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>


no users left

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 44907d79 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

get rid of 'opened' argument of ->atomic_open() - part 3

now it can be done...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# be12af3e 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

getting rid of 'opened' argument of ->atomic_open() - part 1

'opened' argument of finish_open() is unused. Kill it.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>

# 73a09dd9 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

introduce FMODE_CREATED and switch to it

Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...

Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# f5d11409 09-Jul-2018 Al Viro <viro@zeniv.linux.org.uk>

introduce FMODE_OPENED

basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".

NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# b4e7a7a8 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

Failure of ->open() should *not* be followed by fput(). Fixed by
using filp_clone_open(), which gets the cleanups right.

Cc: stable@vger.kernel.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 19f391eb 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

turn filp_clone_open() into inline wrapper for dentry_open()

it's exactly the same thing as
dentry_open(&file->f_path, file->f_flags, file->f_cred)

... and rename it to file_clone_open(), while we are at it.
'filp' naming convention is bogus; sure, it's "file pointer",
but we generally don't do that kind of Hungarian notation.
Some of the instances have too many callers to touch, but this
one has only two, so let's sanitize it while we can...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 69448867 05-Jul-2018 Amir Goldstein <amir73il@gmail.com>

fs: shave 8 bytes off of struct inode

Here is a link to Linus' reply to Jan's concern about making
i_blkbibts byte addressable:

Here is a link to an lkp.org report about potential performance
improvement in some workload, which could(?) be related to packing
i_blkbits closer to i_bytes/i_lock:

Changes since v1:
- Add links to relevant discussions

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 87eb5eb2 06-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: dedupe: rationalize args

Clean up f_op->dedupe_file_range() interface.

1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

# 5740c99e 06-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: dedupe: return int

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# a11e1d43 28-Jun-2018 Linus Torvalds <torvalds@linux-foundation.org>

Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# cf85ec88 09-Jun-2018 Vasily Averin <vvs@virtuozzo.com>

removed extra extern file_fdatawait_range

Jeff added this extern twice in commit a823e4589e68

Fixes: a823e4589e68 ("mm: add file_fdatawait_range and file_write_and_wait")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 95582b00 08-May-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: change inode times to use struct timespec64

struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
- struct timespec
+ struct timespec64
current_time ( ... )
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
- return timespec_trunc(
+ return timespec64_trunc(
... );

@ depends on patch @
identifier xtime;
struct \( iattr \| inode \| kstat \) {
- struct timespec xtime;
+ struct timespec64 xtime;

@ depends on patch @
identifier t;
struct inode_operations {
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }

@ depends on patch @
identifier t;
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
- struct timespec ts;
+ struct timespec64 ts;
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);

<+... when != ts
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
ts = current_time(e)
fn_update_time(..., &ts,...)
inode_node->i_xtime = ts
node1->i_xtime = ts
ts = inode_node->i_xtime
<+... attr1->ia_xtime ...+> = ts
ts = attr1->ia_xtime
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
- ts = timespec64_to_timespec(
+ ts =
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
- ts = E3
+ ts = timespec_to_timespec64(E3)
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
- ts
+ timespec64_to_timespec(ts)
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
+ struct timespec ts;
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
+ struct timespec ts;
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
stat->xtime = node2->i_xtime1;
stat1.xtime = node2->i_xtime1;
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
node->i_xtime1 = current_time(...);
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>

# d9a08a9e 22-May-2018 Adam Manzanares <adam.manzanares@wdc.com>

fs: Add aio iopriority support

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# fc28724d 22-May-2018 Adam Manzanares <adam.manzanares@wdc.com>

fs: Convert kiocb rw_hint from enum to u16

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 80ea09a0 17-May-2018 Miklos Szeredi <miklos@szeredi.hu>

vfs: factor out inode_insert5()

Split out common helper for race free insertion of an already allocated
inode into the cache. Use this from iget5_locked() and
insert_inode_locked4(). Make iget5_locked() use new_inode()/iput() instead
of alloc_inode()/destroy_inode() directly.

Also export to modules for use by filesystems which want to preallocate an
inode before file/directory creation.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 5afb7835 29-May-2018 Christoph Hellwig <hch@lst.de>

block: don't print a message when the device went away

The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 3deb642f 09-Jan-2018 Christoph Hellwig <hch@lst.de>

fs: introduce new ->get_poll_head and ->poll_mask methods

->get_poll_head returns the waitqueue that the poll operation is going
to sleep on. Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events. But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

# 8efd6894 22-Apr-2018 Deepa Dinamani <deepa.kernel@gmail.com>

fs: add timespec64_truncate()

As vfs moves to using struct timespec64 to represent times,
update the argument to timespec_truncate() to use
struct timespec64. Also change the name of the function.
The rest of the implementation logic is the same.

Move this to fs/inode.c instead of kernel/time/time.c as all the
users of this api are filesystems.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <viro@zeniv.linux.org.uk>

# bbbc3fb6 30-Apr-2018 Eric Biggers <ebiggers@google.com>

fs, fscrypt: only define ->s_cop when FS_ENCRYPTION is enabled

Now that filesystems only set and use their fscrypt_operations when they
are built with encryption support, we can remove ->s_cop from
'struct super_block' when FS_ENCRYPTION is disabled. This saves a few
bytes on some kernels and also makes it consistent with ->i_crypt_info.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# 75abe329 17-May-2018 Li Qiang <liq3ea@gmail.com>

fs.h: fix outdated comment about file flags

The __dentry_open function was removed in
commit <2a027e7a18738>("fold __dentry_open() into its sole caller").

Signed-off-by: Li Qiang <liq3ea@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 0c8e3fe3 13-May-2018 Amir Goldstein <amir73il@gmail.com>

vfs: add the sb_start_intwrite_trylock() helper

Needed by ext4 to test frozen fs before updating s_last_mounted.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>

# 7a107c0f 05-Apr-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

fasync: Fix deadlock between task-context and interrupt-context kill_fasync()

I observed the following deadlock between them:

[task 1] [task 2] [task 3]
kill_fasync() mm_update_next_owner() copy_process()
spin_lock_irqsave(&fa->fa_lock) read_lock(&tasklist_lock) write_lock_irq(&tasklist_lock)
send_sigio() <IRQ> ...
read_lock(&fown->lock) kill_fasync() ...
read_lock(&tasklist_lock) spin_lock_irqsave(&fa->fa_lock) ...

Task 1 can't acquire read locked tasklist_lock, since there is
already task 3 expressed its wish to take the lock exclusive.
Task 2 holds the read locked lock, but it can't take the spin lock.

Also, there is possible another deadlock (which I haven't observed):

[task 1] [task 2]
f_getown() kill_fasync()
read_lock(&f_own->lock) spin_lock_irqsave(&fa->fa_lock,)
<IRQ> send_sigio() write_lock_irq(&f_own->lock)
kill_fasync() read_lock(&fown->lock)

Actually, we do not need exclusive fa->fa_lock in kill_fasync_rcu(),
as it guarantees fa->fa_file->f_owner integrity only. It may seem,
that it used to give a task a small possibility to receive two sequential
signals, if there are two parallel kill_fasync() callers, and task
handles the first signal fastly, but the behaviour won't become
different, since there is exclusive sighand lock in do_send_sig_info().

The patch converts fa_lock into rwlock_t, and this fixes two above
deadlocks, as rwlock is allowed to be taken from interrupt handler
by qrwlock design.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# b93b0163 10-Apr-2018 Matthew Wilcox <willy@infradead.org>

page cache: use xa_lock

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a09acf4b 09-Apr-2018 David Howells <dhowells@redhat.com>

vfs: Remove the const from dir_context::actor

Remove the const marking from the actor function pointer in the dir_context
struct. The const prevents the structure from being used as part of a
kmalloc'd object as it makes the compiler require that the actor member be
set at object initialisation time (or not at all), incuring something like
the following error if you try and set it later:

fs/afs/dir.c:556:20: error: assignment of read-only member 'actor'

Marking the member const like this adds very little in the way of sanity
checking as the type checking system is likely to provide sufficient - and
if not, the kernel is very likely to oops repeatably in this case.

Fixes: ac6614b76478 ("[readdir] constify ->actor")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>

# f44c7763 07-Mar-2018 Dan Williams <dan.j.williams@intel.com>

fs, dax: prepare for dax-specific address_space_operations

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# 0e11f644 21-Feb-2018 Christoph Hellwig <hch@lst.de>

fs: move I_DIRTY_INODE to fs.h

And use it in a few more places rather than opencoding the values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 57b56ac6 21-Feb-2018 Mimi Zohar <zohar@linux.vnet.ibm.com>

ima: fail file signature verification on non-init mounted filesystems

FUSE can be mounted by unprivileged users either today with fusermount
installed with setuid, or soon with the upcoming patches to allow FUSE
mounts in a non-init user namespace.

This patch addresses the new unprivileged non-init mounted filesystems,
which are untrusted, by failing the signature verification.

This patch defines two new flags SB_I_IMA_UNVERIFIABLE_SIGNATURE and

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Cc: Alban Crequy <alban@kinvolk.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 08fdc8a0 03-Oct-2017 Mateusz Guzik <mguzik@redhat.com>

buffer.c: call thaw_super during emergency thaw

There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.

In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained

The patch is a hack which adds blind unfreezing.

In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1c949843 28-Feb-2018 Rasmus Villemoes <linux@rasmusvillemoes.dk>

vfs: make sure struct filename->iname is word-aligned

I noticed that offsetof(struct filename, iname) is actually 28 on 64
bit platforms, so we always pass an unaligned pointer to
strncpy_from_user. This is mostly a problem for those 64 bit platforms
without HAVE_EFFICIENT_UNALIGNED_ACCESS, but even on x86_64, unaligned
accesses carry a penalty.

A user-space microbenchmark doing nothing but strncpy_from_user from the
same (aligned) source string runs about 5% faster when the destination
is aligned. That number increases to 20% when the string is long
enough (~32 bytes) that we cross a cache line boundary - that's for
example the case for about half the files a "git status" in a kernel
tree ends up stat'ing.

This won't make any real-life workloads 5%, or even 1%, faster, but path
lookup is common enough that cutting even a few cycles should be
worthwhile. So ensure we always pass an aligned destination pointer to
strncpy_from_user. Instead of explicit padding, simply swap the refcnt
and aname members, as suggested by Al Viro.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 95dd7758 14-Mar-2018 Eric W. Biederman <ebiederm@xmission.com>

fs: Teach path_connected to handle nfs filesystems with multiple roots.

On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem. The subsets can be from
disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem. The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount. This move can happen either locally or
remotely. With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check. As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally. Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to. So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar. But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ce3077ee 23-Feb-2018 Nikolay Borisov <nborisov@suse.com>

direct-io: Remove unused DIO_SKIP_DIO_COUNT logic

This flag was added by fe0f07d08ee3 ("direct-io: only inc/deci
inode->i_dio_count for file systems") as means to optimise the atomic
modificaiton of the variable for blockdevices. However with the advent
of 542ff7bf18c6 ("block: new direct I/O implementation") it became
unused. So let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# c8f4c36f 23-Feb-2018 Nikolay Borisov <nborisov@suse.com>

direct-io: Remove unused DIO_ASYNC_EXTEND flag

This flag was added by 6039257378e4 ("direct-io: add flag to allow aio
writes beyond i_size") to support XFS. However, with the rework of
XFS' DIO's path to use iomap in acdda3aae146 ("xfs: use iomap_dio_rw")
it became redundant. So let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 230f5a89 21-Feb-2018 Dan Williams <dan.j.williams@intel.com>

dax: fix vma_is_fsdax() helper

Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use
S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on
device-dax instances when those are meant to be explicitly allowed.

Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm")
Cc: <stable@vger.kernel.org>
Reported-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Jane Chu <jane.chu@oracle.com>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# 01c2e13d 18-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: only grab shared inode locks for source file during reflink

Reflink and dedupe operations remap blocks from a source file into a
destination file. The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

# f02a9ad1 21-Dec-2017 Jeff Layton <jlayton@kernel.org>

fs: handle inode->i_version more efficiently

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit. If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>

# ae5e165d 29-Jan-2018 Jeff Layton <jlayton@kernel.org>

fs: new API for handling inode->i_version

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>

# 4bfd054a 16-Jan-2018 Eric Biggers <ebiggers@google.com>

fs: fold __inode_permission() into inode_permission()

Since commit 9c630ebefeee ("ovl: simplify permission checking"),
overlayfs doesn't call __inode_permission() anymore, which leaves no
users other than inode_permission(). So just fold it back into

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# e1fc742e 29-Sep-2017 Jürg Billeter <j@bitron.ch>

fs: add RWF_APPEND

This is the per-I/O equivalent of O_APPEND to support atomic append
operations on any open file.

If a file is opened with O_APPEND, pwrite() ignores the offset and
always appends data to the end of the file. RWF_APPEND enables atomic
append and pwrite() with offset on a single file descriptor.

Signed-off-by: Jürg Billeter <j@bitron.ch>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 5a9d929d 08-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

iomap: report collisions between directio and buffered writes to userspace

If two programs simultaneously try to write to the same part of a file
via direct IO and buffered IO, there's a chance that the post-diowrite
pagecache invalidation will fail on the dirty page. When this happens,
the dio write succeeded, which means that the page cache is no longer
coherent with the disk!

Programs are not supposed to mix IO types and this is a clear case of
data corruption, so store an EIO which will be reflected to userspace
during the next fsync. Replace the WARN_ON with a ratelimited pr_crit
so that the developers have /some/ kind of breadcrumb to track down the
offending program(s) and file(s) involved.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

# 8e6c848e 01-Dec-2017 Al Viro <viro@zeniv.linux.org.uk>

new primitive: vfs_mkobj()

Similar to vfs_create(), but with caller-supplied callback (and
argument for it) to be used instead of ->create().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# f1ee6162 20-Dec-2017 NeilBrown <neilb@suse.com>

VFS: don't keep disconnected dentries on d_anon

The original purpose of the per-superblock d_anon list was to
keep disconnected dentries in the cache between consecutive
requests to the NFS server. Dentries can be disconnected if
a client holds a file open and repeatedly performs IO on it,
and if the server drops the dentry, whether due to memory
pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".

This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked
file indefinitely delayed") which caused disconnected dentries
to be freed as soon as their refcount reached zero.

This means that, when a dentry being used by nfsd gets disconnected, a
new one needs to be allocated for every request (unless requests
overlap). As the dentry has no name, no parent, and no children,
there is little of value to cache. As small memory allocations are
typically fast (from per-cpu free lists) this likely has little cost.

This means that the original purpose of s_anon is no longer relevant:
there is no longer any need to keep disconnected dentries on a list so
they appear to be hashed.

However, s_anon now has a new use. When you mount an NFS filesystem,
the dentry stored in s_root is just a placebo. The "real" root dentry
is allocated using d_obtain_root() and so it kept on the s_anon list.
I don't know the reason for this, but suspect it related to NFSv4
where a mount of "server:/some/path" require NFS to look up the root
filehandle on the server, then walk down "/some" and "/path" to get
the filehandle to mount.

Whatever the reason, NFS depends on the s_anon list and on
shrink_dcache_for_umount() pruning all dentries on this list. So we
cannot simply remove s_anon.

We could just leave the code unchanged, but apart from that being
potentially confusing, the (unfair) bit-spin-lock which protects
s_anon can become a bottle neck when lots of disconnected dentries are
being created.

So this patch renames s_anon to s_roots, and stops storing
disconnected dentries on the list. Only dentries obtained with
d_obtain_root() are now stored on this list. There are many fewer of
these (only NFS and NILFS2 use the call, and only during filesystem
mount) so contention on the bit-lock will not be a problem.

Possibly an alternate solution should be found for NFS and NILFS2, but
that would require understanding their needs first.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 5d38f049 29-Nov-2017 Ian Kent <raven@themaw.net>

autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"

Commit 42f461482178 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
flag but introduced a semantic change.

In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
negative dentry case for stat family system calls in follow_automount().

This changed the unconditional triggering of an automount in this case
to no longer be done and an error returned instead.

This has caused more problems than I expected so reverting the change is

In a discussion with Neil Brown it was concluded that the automount(8)
daemon can implement this change without kernel modifications. So that
will be done instead and the autofs module documentation updated with a
description of the problem and what needs to be done by module users for
this specific case.

Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net
Fixes: 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Neil Brown <neilb@suse.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: <stable@vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 2bb6d283 29-Nov-2017 Dan Williams <dan.j.williams@intel.com>

mm: introduce get_user_pages_longterm

Patch series "introduce get_user_pages_longterm()", v2.

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy
change implies is one of the reasons we maintain the "dax enabled.
Warning: EXPERIMENTAL, use at your own risk" notification when mounting
a filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

This patch (of 4):

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas. Device-dax vmas do not have this problem and are
explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and

[akpm@linux-foundation.org: use kcalloc()]
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a3f8683b 02-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

->poll() methods should return __poll_t

The most common place to find POLL... bitmaps: return values
of ->poll() and its subsystem counterparts.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1751e8a6 27-Nov-2017 Linus Torvalds <torvalds@linux-foundation.org>

Rename superblock flags (MS_xyz -> SB_xyz)

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants

for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a7d3d039 10-Sep-2017 Christoph Hellwig <hch@lst.de>

integrity: use kernel_read_file_from_path() to read x509 certs

The CONFIG_IMA_LOAD_X509 and CONFIG_EVM_LOAD_X509 options permit
loading x509 signed certificates onto the trusted keyrings without
verifying the x509 certificate file's signature.

This patch replaces the call to the integrity_read_file() specific
function with the common kernel_read_file_from_path() function.
To avoid verifying the file signature, this patch defines

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>

# 63dcb81e 02-Oct-2017 Mike Rapoport <rppt@linux.vnet.ibm.com>

include/linux/fs.h: fix comment about struct address_space

Before commit 9c5d760b8d22 ("mm: split gfp_mask and mapping flags into
separate fields") the private_* fields of struct adrress_space were grouped
together and using "ditto" in comments describing the last fields was

With introduction of gpf_mask between private_lock and private_list "ditto"
references the wrong description.

Fix it by using the elaborate description.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1c972597 01-Nov-2017 Dan Williams <dan.j.williams@intel.com>

mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:

I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do

| MAP_SYNC, fd, 0);

and "know" that MAP_SYNC actually takes.

And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.

Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do

| MAP_SYNC, fd, 0);

and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.

Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# 2ee6a576 09-Oct-2017 Eric Biggers <ebiggers@google.com>

fs, fscrypt: add an S_ENCRYPTED inode flag

Introduce a flag S_ENCRYPTED which can be set in ->i_flags to indicate
that the inode is encrypted using the fscrypt (fs/crypto/) mechanism.

Checking this flag will give the same information that
inode->i_sb->s_cop->is_encrypted(inode) currently does, but will be more
efficient. This will be useful for adding higher-level helper functions
for filesystems to use. For example we'll be able to replace this:

if (ext4_encrypted_inode(inode)) {
ret = fscrypt_get_encryption_info(inode);
if (ret)
return ret;
if (!fscrypt_has_encryption_key(inode))
return -ENOKEY;

with this:

ret = fscrypt_require_key(inode);
if (ret)
return ret;

... since we'll be able to retain the fast path for unencrypted files as
a single flag check, using an inline function. This wasn't possible
before because we'd have had to frequently call through the
->i_sb->s_cop->is_encrypted function pointer, even when the encryption
support was disabled or not being used.

Note: we don't define S_ENCRYPTED to 0 if CONFIG_FS_ENCRYPTION is
disabled because we want to continue to return an error if an encrypted
file is accessed without encryption support, rather than pretending that
it is unencrypted.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

# f175f307 14-Oct-2017 Al Viro <viro@zeniv.linux.org.uk>

stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 53fd88ab6 14-Oct-2017 Al Viro <viro@zeniv.linux.org.uk>

make vfs_ustat() static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 32e57c29 03-Oct-2017 Mike Rapoport <rppt@linux.vnet.ibm.com>

include/linux/fs.h: fix comment about struct address_space

Before commit 9c5d760b8d22 ("mm: split gfp_mask and mapping flags into
separate fields") the private_* fields of struct adrress_space were
grouped together and using "ditto" in comments describing the last
fields was correct.

With introduction of gpf_mask between private_lock and private_list
"ditto" references the wrong description.

Fix it by using the elaborate description.

Link: http://lkml.kernel.org/r/1507009987-8746-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 003278e4 26-Sep-2017 Corentin Labbe <clabbe.montjoie@gmail.com>

nfs_common: convert int to bool

Since __state_in_grace return only true/false, make it return bool
instead of int.
Same change for the two user of it, locks_in_grace/opens_in_grace

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

# 711aab1d 12-Sep-2017 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: constify path argument to kernel_read_file_from_path

This patch constifies the path argument to kernel_read_file_from_path().

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 42f46148 08-Sep-2017 Ian Kent <raven@themaw.net>

autofs: fix AT_NO_AUTOMOUNT not being honored

The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call. But this flag is unconditionally cleared for all
stat family system calls except statx().

stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.

In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are

AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.

It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process. But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.

This is the same semantic as the positive dentry case already handled.

Link: http://lkml.kernel.org/r/150216641255.11652.4204561328197919771.stgit@pluto.themaw.net
Fixes: deccf497d804a4c5fca ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# f808c13f 08-Sep-2017 Davidlohr Bueso <dave@stgolabs.net>

lib/interval_tree: fast overlap detection

Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().

As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().

[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a446d6f9 06-Sep-2017 Jeff Layton <jlayton@kernel.org>

include/linux/fs.h: remove unneeded forward definition of mm_struct

Link: http://lkml.kernel.org/r/20170525102927.6163-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 9725d4ce 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: unexport vfs_readv and vfs_writev

We've got no modular users left, and any potential modular user is better
of with iov_iter based variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# eb031849 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: unexport __vfs_read/__vfs_write

No modular users left, and any new ones should use kernel_read/write
or iov_iter variants instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 73e18f7c 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: make the buf argument to __kernel_write a void pointer

This matches kernel_read and kernel_write and avoids any need for casts in
the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# e13ec939 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: fix kernel_write prototype

Make the position an in/out argument like all the other read/write
helpers and and make the buf argument a void pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# bdd1d2d3 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: fix kernel_read prototype

Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 91f9943e 29-Aug-2017 Christoph Hellwig <hch@lst.de>

fs: support RWF_NOWAIT for buffered reads

This is based on the old idea and code from Milosz Tanski. With the aio
nowait code it becomes mostly trivial now. Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 495e6429 04-Sep-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: add flags to d_real()

Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 799ea9e9 18-Aug-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: evict all inodes involved with log redo item

When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery. This also had the
effect of putting linked inodes on the lru instead of evicting them.

Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.

Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>

# ddef7ed2 06-Jul-2017 Christoph Hellwig <hch@lst.de>

annotate RWF_... flags

[AV: added missing annotations in syscalls.h/compat.h]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 0cc3b0ec 27-Aug-2017 Linus Torvalds <torvalds@linux-foundation.org>

Clarify (and fix) MAX_LFS_FILESIZE macros

We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.

It turns out that our 32-bit value for that limit was bogus. On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong. We used to
define that value to

(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).

Neither of those limitations make sense. The index is actually the full
32 bit unsigned value, and we can use that whole full page. So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".

However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow. So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.

So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.

The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.

This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.

NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed. But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.

So just use LLONG_MAX for the 64-bit case. That was what the value had
been before too, just written out as a hex constant.

Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c2ee070f 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: cache the partition index in struct block_device

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# f4a8116a 08-Aug-2017 Jan Kara <jack@suse.cz>

fs: Provide __inode_get_bytes()

Provide helper __inode_get_bytes() which assumes i_lock is already
acquired. Quota code will need this to be able to use i_lock to protect
consistency of quota accounting information and inode usage.

Signed-off-by: Jan Kara <jack@suse.cz>

# ffb959bb 31-Jul-2017 Jeff Layton <jlayton@kernel.org>

mm: remove optimizations based on i_size in mapping writeback waits

Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

commit 765dad09b4ac101a32d87af2bb793c3060497d3c
Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Date: Tue Sep 7 17:51:17 2004 -0700

small wait_on_page_writeback_range() optimization

filemap_fdatawait() calls wait_on_page_writeback_range() with -1
as "end" parameter. This is not needed since we know the EOF
from the inode. Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# a823e458 28-Jul-2017 Jeff Layton <jlayton@kernel.org>

mm: add file_fdatawait_range and file_write_and_wait

Necessary now for gfs2_fsync and sync_file_range, but there will
eventually be other callers.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# 80aafd50 24-Jul-2017 Jeff Layton <jlayton@kernel.org>

Documentation: add some docs for errseq_t

...and fix up a few comments in the code.

Signed-off-by: Jeff Layton <jlayton@redhat.com>

# 133d55cd 16-Jun-2017 Logan Gunthorpe <logang@deltatee.com>

block: order /proc/devices by major number

Presently, the order of the block devices listed in /proc/devices is not
entirely sequential. If a block device has a major number greater than
BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
limit (chosen to be 512). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# 8a932f73 15-Jun-2017 Logan Gunthorpe <logang@deltatee.com>

char_dev: order /proc/devices by major number

Presently, the order of the char devices listed in /proc/devices is not
entirely sequential. If a char device has a major number greater than
CHRDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce CHRDEV_MAJOR_MAX as an artificial
limit (chosen to be 511). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# a5d31a3f 15-Jun-2017 Logan Gunthorpe <logang@deltatee.com>

char_dev: extend dynamic allocation of majors into a higher range

We've run into problems with running out of dynamicly assign char
device majors particullarly on automated test systems with
all-yes-configs. Roughly 40 dynamic assignments can be made with such
kernels at this time while space is reserved for only 20.

Currently, the kernel only prints a warning when dynamic allocation
overflows the reserved region. And when this happens drivers that have
fixed assignments can randomly fail depending on the order of
initialization of other drivers. Thus, adding a new char device can cause
unexpected failures in completely unrelated parts of the kernel.

This patch solves the problem by extending dynamic major number
allocations down from 511 once the 234-254 region fills up. Fixed
majors already exist above 255 so the infrastructure to support
high number majors is already in place. The patch reserves an
additional 128 major numbers which should hopefully last us a while.

Kernels that don't require more than 20 dynamic majors assigned (which
is pretty typical) should not be affected by this change.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: https://lkml.org/lkml/2017/6/4/107
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# e462ec50 17-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Differentiate mount flags (MS_*) from internal superblock flags

Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags. s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

(1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
MNT_NODEV) without passing this on to the filesystem, but some
filesystems set such flags anyway.

(2) The ->remount_fs() methods of some filesystems adjust the *flags
argument by setting MS_* flags in it, such as MS_NOATIME - but these
flags are then scrubbed by do_remount_sb() (only the occupants of

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>

# 94e92e7a 17-Jul-2017 David Howells <dhowells@redhat.com>

vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

Add an sb_rdonly() function to query the MS_RDONLY flag on sb->s_flags
preparatory to providing an SB_RDONLY flag.

Signed-off-by: David Howells <dhowells@redhat.com>

# 9d5b86ac 16-Jul-2017 Benjamin Coddington <bcodding@redhat.com>

fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks

Since commit c69899a17ca4 "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context. The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.

The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK. There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid. So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.

The translaton and presentation of fl_pid should handle the following four

1 - F_GETLK on a remote file with a remote lock:
In this case, the filesystem should determine the l_pid to return here.
Filesystems should indicate that the fl_pid represents a non-local pid
value that should not be translated by returning an fl_pid <= 0.

2 - F_GETLK on a local file with a remote lock:
This should be the l_pid of the lock manager process, and translated.

3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
These should be the translated l_pid of the local locking process.

Fuse was already doing the correct thing by translating the pid into the
caller's namespace. With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.

With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0. This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling

Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.

Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.

If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid. This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# 1d278a87 05-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Kill off s_options and helpers

Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves. This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# baabda26 10-Jul-2017 Dan Williams <dan.j.williams@intel.com>

mm: always enable thp for dax mappings

The madvise policy for transparent huge pages is meant to avoid unwanted
allocations of transparent huge pages. It allows a policy of disabling
the extra memory pressure and effort to arrange for a huge page when it
is not needed.

DAX by definition never incurs this overhead since it is statically
allocated. The policy choice makes even less sense for device-dax which
tries to guarantee a given tlb-fault size. Specifically, the following

echo never > /sys/kernel/mm/transparent_hugepage/enabled

...violates that guarantee and silently disables all device-dax
instances with a 2M or 1G alignment. So, let's avoid that non-obvious
side effect by force enabling thp for dax mappings in all cases.

It is worth noting that the reason this uses vma_is_dax(), and the
resulting header include changes, is that previous attempts to add a
VM_DAX flag were NAKd.

Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 5660e13d 06-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>

# 76341cab 06-Jul-2017 Jeff Layton <jlayton@kernel.org>

jbd2: don't clear and reset errors after waiting on writeback

Resetting this flag is almost certainly racy, and will be problematic
with some coming changes.

Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
Have jbd2 call it instead of filemap_fdatawait and don't attempt to
re-set the error flag if it fails.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# ee416bcd 04-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Make get_filesystem() return the affected filesystem

Make get_filesystem() return a pointer to the filesystem on which it just
got a ref.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 0f41074a 05-Jul-2017 Jeff Layton <jlayton@kernel.org>

fs: remove call_fsync helper function

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# ad0af710 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

vfs: introduce inode 'inuse' lock

Added an i_state flag I_INUSE and helpers to set/clear/test the bit.

The 'inuse' lock is an 'advisory' inode lock, that can be used to extend
exclusive create protection beyond parent->i_mutex lock among cooperating

This is going to be used by overlayfs to get exclusive ownership on upper
and work dirs among overlayfs mounts.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 3859a271 28-Oct-2016 Kees Cook <keescook@chromium.org>

randstruct: Mark various structs for randomization

This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>

# abbb6589 27-May-2017 Christoph Hellwig <hch@lst.de>

fs: implement vfs_iter_write using do_iter_write

De-dupliate some code and allow for passing the flags argument to
vfs_iter_write. Additionally it now properly updates timestamps.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 18e9710e 27-May-2017 Christoph Hellwig <hch@lst.de>

fs: implement vfs_iter_read using do_iter_read

De-dupliate some code and allow for passing the flags argument to
vfs_iter_read. Additional it properly updates atime now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# c75b1d94 27-Jun-2017 Jens Axboe <axboe@kernel.dk>

fs: add fcntl() interface for setting/getting write life time hints

Define a set of write life time hints:

RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.

F_SET_RW_HINT Set one of the above write hints on the
underlying inode.

F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.

F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.

This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.

* writehint.c: get or set an inode write hint
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>

#ifndef F_GET_RW_HINT

static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",

int main(int argc, char *argv[])
uint64_t hint;
int fd, ret;

if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
return 2;

if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;

ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;

printf("%s: hint %s\n", argv[1], str[hint]);
return 0;

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# b745fafa 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>


RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 7fc9e472 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: Introduce filemap_range_has_page()

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# fdd2f5b7 20-Jun-2017 Goldwyn Rodrigues <rgoldwyn@suse.com>

fs: Separate out kiocb flags setup based on RWF_* flags

Also added RWF_SUPPORTED to encompass all flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

# 5dd43ce2 19-Jun-2017 Ingo Molnar <mingo@kernel.org>

sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>

The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

include/linux/wait_bit.h for types and APIs
kernel/sched/wait_bit.c for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

# 393cc3f5 13-Jun-2017 Jiri Slaby <jirislaby@kernel.org>

fs/fcntl: f_setown, allow returning error

Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# 4055351c 03-Jun-2017 Christoph Hellwig <hch@lst.de>

fs: remove the unused error argument to dio_end_io()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 85787090 10-May-2017 Christoph Hellwig <hch@lst.de>

fs: switch ->s_uuid to uuid_t

For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

# a75d30c7 27-May-2017 Christoph Hellwig <hch@lst.de>

fs/locks: pass kernel struct flock to fcntl_getlk/setlk

This will make it easier to implement a sane compat fcntl syscall.

[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>

# 572e0ca9 12-May-2017 Deepa Dinamani <deepa.kernel@gmail.com>

time: delete current_fs_time()

All uses of the current_fs_time() function have been replaced by other
time interfaces.

And, its use cases can be fulfilled by current_time() or ktime_get_*

Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c718a975 08-May-2017 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag

Commit afddba49d18f ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c707 ("[CIFS] fix regression in

Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
remove this flag. This patch has no functionality changes.

Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# deccf497 04-May-2017 David Howells <dhowells@redhat.com>

Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()

stat/lstat/fstatat need to pass AT_NO_AUTOMOUNT to vfs_statx() as the
pre-statx code didn't set LOOKUP_AUTOMOUNT, even though fstatat() accepted

Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available")
Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# cda37124 25-Mar-2017 Eric Biggers <ebiggers@google.com>

fs: constify tree_descr arrays passed to simple_fill_super()

simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory. Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# a0c111b4 08-Apr-2017 Geliang Tang <geliangtang@gmail.com>

fs: drop duplicate header percpu-rwsem.h

Drop duplicate header percpu-rwsem.h from linux/fs.h.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 50f2112c 10-Apr-2017 Benjamin Coddington <bcodding@redhat.com>

locks: Set FL_CLOSE when removing flock locks on close()

Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

# c1844d53 11-Apr-2017 Jan Kara <jack@suse.cz>

fs: Remove SB_I_DYNBDI flag

Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

# fca39346 11-Apr-2017 Jan Kara <jack@suse.cz>

fs: Provide infrastructure for dynamic BDIs in filesystems

Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-mtd@lists.infradead.org
CC: linux-nfs@vger.kernel.org
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-nilfs@vger.kernel.org
CC: cluster-devel@redhat.com
CC: osd-dev@open-osd.org
CC: codalist@coda.cs.cmu.edu
CC: linux-afs@lists.infradead.org
CC: ecryptfs@vger.kernel.org
CC: linux-cifs@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-developer@lists.sourceforge.net
CC: lustre-devel@lists.lustre.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 08991e83 01-Feb-2017 Jan Kara <jack@suse.cz>

fsnotify: Free fsnotify_mark_connector when there is no mark attached

Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# 9dd813c1 13-Mar-2017 Jan Kara <jack@suse.cz>

fsnotify: Move mark list head from object into dedicated structure

Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>

# cbfd0c10 09-Mar-2017 Arnd Bergmann <arnd@arndb.de>

include/linux/fs.h: fix unsigned enum warning with gcc-4.2

With arm-linux-gcc-4.2, almost every file we build in the kernel ends up
with this warning:

include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is always false

Later versions don't have this problem, but it's easy enough to work

Link: http://lkml.kernel.org/r/20161216105634.235457-12-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.


The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.


The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in


The following structures are defined in which to return the main attribute

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:


[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.


The following test program can be used to test the statx system call:


Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 93407472 27-Feb-2017 Fabian Frederick <fabf@skynet.be>

fs: add i_blocksize()

Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0eb8af49 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helper for calling f_op->fsync()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# f74ac015 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

mm: use helper for calling f_op->mmap()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# bb7462b6 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helpers for calling f_op->{read,write}_iter()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# bfe219d3 31-Jan-2017 Amir Goldstein <amir73il@gmail.com>

vfs: wrap write f_ops with file_{start,end}_write()

Before calling write f_ops, call file_start_write() instead
of sb_start_write().

Replace {sb,file}_start_write() for {copy,clone}_file_range() and
for fallocate().

Beyond correct semantics, this avoids freeze protection to sb when
operating on special inodes, such as fallocate() on a blockdev.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# af7bd4dc 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

vfs: create vfs helper vfs_tmpfile()

Factor out some common vfs bits from do_tmpfile()
to be used by overlayfs for concurrent copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# b1d2dc56 02-Feb-2017 Jan Kara <jack@suse.cz>

block: Make blk_get_backing_dev_info() safe without open bdev

Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Jens Axboe <axboe@fb.com>

# f44f1ab5 02-Feb-2017 Jan Kara <jack@suse.cz>

block: Unhash block device inodes on gendisk destruction

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 031a072a 23-Sep-2016 Amir Goldstein <amir73il@gmail.com>

vfs: call vfs_clone_file_range() under freeze protection

Move sb_start_write()/sb_end_write() out of the vfs helper and up into the
ioctl handler.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 876bec6f 09-Dec-2016 Darrick J. Wong <darrick.wong@oracle.com>

vfs: refactor clone/dedupe_file_range common functions

Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

# d16744ec 09-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: make generic_readlink() static

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 76fca90e 09-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: default to generic_readlink()

If i_op->readlink is NULL, but i_op->get_link is set then vfs_readlink()
defaults to calling generic_readlink().

The IOP_DEFAULT_READLINK flag indicates that the above conditions are met
and the default action can be taken.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# fd4a0edf 09-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: replace calling i_op->readlink with vfs_readlink()

Also check d_is_symlink() in callers instead of inode->i_op->readlink
because following patches will allow NULL ->readlink for symlinks.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# f0bb5aaf 20-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

vfs: misc struct path constification

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ca71cf71 20-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

namespace.c: constify struct path passed to a bunch of primitives

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 640eb7e7 14-Nov-2016 Mickaël Salaün <mic@digikod.net>

fs: Constify path_is_under()'s arguments

The function path_is_under() doesn't modify the paths pointed by its
arguments but only browse them. Constifying this pointers make a cleaner
interface to be used by (future) code which may only have access to
const struct path pointers (e.g. LSM hooks).

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# ba6379f7 22-Nov-2016 Jan Kara <jack@suse.cz>

fs: Provide function to get superblock with exclusive s_umount

Quota code will need a variant of get_super_thawed() that returns
superblock with s_umount held in exclusive mode to serialize quota on
and quota off operations. Provide this functionality.

Signed-off-by: Jan Kara <jack@suse.cz>

# 2f8b5444 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block,fs: untangle fs.h and blk_types.h

Nothing in fs.h should require blk_types.h to be included.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 1e3914d4 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block, fs: move submit_bio to bio.h

This is where all the other bio operations live, so users must include
bio.h anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# d3849953 01-Nov-2016 Christoph Hellwig <hch@lst.de>

fs: decouple READ and WRITE from the block layer ops

Move READ and WRITE to kernel.h and don't define them in terms of block
layer ops; they are our generic data direction indicators these days
and have no more resemblance with the block layer ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 70fd7614 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block,fs: use REQ_* flags directly

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# a2b80967 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block: replace REQ_NOIDLE with REQ_IDLE

Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# b685d3d6 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block: treat REQ_FUA and REQ_PREFLUSH as synchronous

Instead of requiring everyone to specify the REQ_SYNC flag aѕ well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 6f6b2917 01-Nov-2016 Christoph Hellwig <hch@lst.de>

block: don't use REQ_SYNC in the READ_SYNC definition

Reads are synchronous per definition, don't add another flag for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 70fe2f48 30-Oct-2016 Jan Kara <jack@suse.cz>

aio: fix freeze protection of aio writes

Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
--runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 723c0384 30-Oct-2016 Christoph Hellwig <hch@lst.de>

fs: remove the never implemented aio_fsync file operation

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 87374179 20-Oct-2016 Christoph Hellwig <hch@lst.de>

block: add a proper block layer data direction encoding

Currently the block layer op_is_write, bio_data_dir and rq_data_dir
helper treat every operation that is not a READ as a data out operation.
This worked surprisingly long, but the new REQ_OP_ZONE_REPORT operation
actually adds a second operation that reads data from the device.
Surprisingly nothing critical relied on this direction, but this might
be a good opportunity to properly fix this issue up.

We take a little inspiration and use the least significant bit of the
operation number to encode the data direction, which just requires us
to renumber the operations to fix this scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# d60874cd 04-Oct-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: add vfs_get_link() helper

This helper is for filesystems that want to read the symlink and are better
off with the get_link() interface (returning a char *) rather than the
readlink() interface (copy into a userspace buffer).

Also call the LSM hook for readlink (not get_link) since this is for
symlink reading not following.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 9c5d760b 11-Oct-2016 Michal Hocko <mhocko@suse.com>

mm: split gfp_mask and mapping flags into separate fields

mapping->flags currently encodes two different things into a single flag.
It contains sticky gfp_mask for page cache allocations and AS_ codes used
to report errors/enospace and other states which are mapping specific.
Condensing the two semantically unrelated things saves few bytes but it
also complicates other things. For one thing the gfp flags space is
reduced and in fact we are already running out of available bits. It can
be assumed that more gfp flags will be necessary later on.

To not introduce the address_space grow (at least on x86_64) we can stick
it right after private_lock because we have a hole there.

struct address_space {
struct inode * host; /* 0 8 */
struct radix_tree_root page_tree; /* 8 16 */
spinlock_t tree_lock; /* 24 4 */
atomic_t i_mmap_writable; /* 28 4 */
struct rb_root i_mmap; /* 32 8 */
struct rw_semaphore i_mmap_rwsem; /* 40 40 */
/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
long unsigned int nrpages; /* 80 8 */
long unsigned int nrexceptional; /* 88 8 */
long unsigned int writeback_index; /* 96 8 */
const struct address_space_operations * a_ops; /* 104 8 */
long unsigned int flags; /* 112 8 */
spinlock_t private_lock; /* 120 4 */

/* XXX 4 bytes hole, try to pack */

/* --- cacheline 2 boundary (128 bytes) --- */
struct list_head private_list; /* 128 16 */
void * private_data; /* 144 8 */

/* size: 152, cachelines: 3, members: 14 */
/* sum members: 148, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */

Link: http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# d0a5b995 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Add IOP_XATTR inode operations flag

The IOP_XATTR inode operations flag in inode->i_opflags indicates that
the inode has xattr support. The flag is automatically set by
new_inode() on filesystems with xattr support (where sb->s_xattr is
defined), and cleared otherwise. Filesystems can explicitly clear it
for inodes that should not have xattr support.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# bba0bd31 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

sockfs: Get rid of getxattr iop

If we allow pseudo-filesystems created with mount_pseudo to have xattr
handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler
to use the xattr handler name parsing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 82c156f8 22-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

switch generic_file_splice_read() to use of ->read_iter()

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 3cd88666 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: Add current_time() api

current_fs_time() is used for inode timestamps.

Change the signature of the function to take inode pointer
instead of superblock as per Linus's suggestion.

Also, move the api under vfs as per the discussion on the
thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
suggestion on the thread, changing the function name.

current_fs_time() will be deleted after all the references
to it are replaced by current_time().

There was a bug reported by kbuild test bot with the change
as some of the calls to current_time() were made before the
super_block was initialized. Catch these accidental assignments
as timespec_trunc() does for wrong granularities. This allows
for the function to work right even in these circumstances.
But, adds a warning to make the user aware of the bug.

A coccinelle script was used to identify all the current
.alloc_inode super_block callbacks that updated inode timestamps.
proc filesystem was the only one that was modifying inode times
as part of this callback. The series includes a patch to fix that.

Note that timespec_trunc() will also be moved to fs/inode.c
in a separate patch when this will need to be revamped for
bounds checking purposes.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9b80a184 01-Sep-2016 Alexey Dobriyan <adobriyan@gmail.com>

fs/file: more unsigned file descriptors

Propagate unsignedness for grand total of 149 bytes:

$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
function old new delta
set_close_on_exec 99 98 -1
put_files_struct 201 200 -1
get_close_on_exec 59 58 -1
do_prlimit 498 497 -1
do_execveat_common.isra 1662 1661 -1
__close_fd 178 173 -5
do_dup2 219 204 -15
seq_show 685 660 -25
__alloc_fd 384 357 -27
dup_fd 718 646 -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 2773bf00 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

fs: rename "rename2" i_op to "rename"

Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 18fc84da 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove unused i_op->rename

No in-tree uses remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# e0e0be8a 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

libfs: support RENAME_NOREPLACE in simple_rename()

This is trivial to do:

- add flags argument to simple_rename()
- check if flags doesn't have any other than RENAME_NOREPLACE
- assign simple_rename() to .rename2 instead of .rename

Filesystems converted:

hugetlbfs, ramfs, bpf.

Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access. For this case pass zero flags to simple_rename().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>

# 31051c85 26-May-2016 Jan Kara <jack@suse.cz>

fs: Give dentry to inode_change_ok() instead of inode

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>

# c568d683 15-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

locks: fix file locking on overlayfs

This patch allows flock, posix locks, ofd locks and leases to work
correctly on overlayfs.

Instead of using the underlying inode for storing lock context use the
overlay inode. This allows locks to be persistent across copy-up.

This is done by introducing locks_inode() helper and using it instead of
file_inode() to get the inode in locking code. For non-overlayfs the two
are equivalent, except for an extra pointer dereference in locks_inode().

Since lock operations are in "struct file_operations" we must also make
sure not to call underlying filesystem's lock operations. Introcude a
super block flag MS_NOREMOTELOCK to this effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>

# 598e3c8f 15-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: update ovl inode before relatime check

On overlayfs relatime_need_update() needs inode times to be correct on
overlay inode. But i_mtime and i_ctime are updated by filesystem code on
underlying inode only, so they will be out-of-date on the overlay inode.

This patch copies the times from the underlying inode if needed. This
can't be done if called from RCU lookup (link following) but link m/ctime
are not updated by fs, so this is all right.

This patch doesn't change functionality for anything but overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# f2b20f6e 15-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: move permission checking into notify_change() for utimes(NULL)

This fixes a bug where the permission was not properly checked in
overlayfs. The testcase is ltp/utimensat01.

It is also cleaner and safer to do the permission checking in the vfs
helper instead of the caller.

This patch introduces an additional ia_valid flag ATTR_TOUCH (since
touch(1) is the most obvious user of utimes(NULL)) that is passed into
notify_change whenever the conditions for this special permission checking
mode are met.

Reported-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Aihua Zhang <zhangaihua1@huawei.com>
Cc: <stable@vger.kernel.org> # v3.18+

# 2a3a2a3f 01-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't cache acl on overlay layer

Some operations (setxattr/chmod) can make the cached acl stale. We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether. Preventing caching has the following

- no double caching, less memory used

- overlay cache doesn't go stale when fs clears it's own cache

Possible disadvantage is performance loss. If that becomes a problem
get_acl() can be optimized for overlayfs.

This patch disables caching by pre setting i_*acl to a value that

- has bit 0 set, so is_uncached_acl() will return true

- is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it

The constant -3 was chosen for this purpose.

Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# c11f0c0b 05-Aug-2016 Jens Axboe <axboe@fb.com>

block/mm: make bdev_ops->rw_page() take a bool for read/write

Commit abf545484d31 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# abf54548 04-Aug-2016 Mike Christie <mchristi@redhat.com>

mm/block: convert rw_page users to bio op use

The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>

# f0fce87c 03-Aug-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: make dentry_needs_remove_privs() internal

Only used by the vfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# 3e5a4c94 03-Aug-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove file_needs_remove_privs()

This function is now unused.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# a098ecd2 02-Aug-2016 Stephen Boyd <stephen.boyd@linaro.org>

firmware: support loading into a pre-allocated buffer

Some systems are memory constrained but they need to load very large
firmwares. The firmware subsystem allows drivers to request this
firmware be loaded from the filesystem, but this requires that the
entire firmware be loaded into kernel memory first before it's provided
to the driver. This can lead to a situation where we map the firmware
twice, once to load the firmware into kernel memory and once to copy the
firmware into the final resting place.

This creates needless memory pressure and delays loading because we have
to copy from kernel memory to somewhere else. Let's add a
request_firmware_into_buf() API that allows drivers to request firmware
be loaded directly into a pre-allocated buffer. This skips the
intermediate step of allocating a buffer in kernel memory to hold the
firmware image while it's read from the filesystem. It also requires
that drivers know how much memory they'll require before requesting the
firmware and negates any benefits of firmware caching because the
firmware layer doesn't manage the buffer lifetime.

For a 16MB buffer, about half the time is spent performing a memcpy from
the buffer to the final resting place. I see loading times go from
0.081171 seconds to 0.047696 seconds after applying this patch. Plus
the vmalloc pressure is reduced.

This is based on a patch from Vikram Mulukutla on codeaurora.org:

Link: http://lkml.kernel.org/r/20160607164741.31849-4-stephen.boyd@linaro.org
Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# d72d9e2a 29-Jul-2016 Miklos Szeredi <mszeredi@redhat.com>

mm: export filemap_check_errors() to modules

Can be used by fuse, btrfs and f2fs to replace opencoded variants.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# bda807d4 26-Jul-2016 Minchan Kim <minchan@kernel.org>

mm: migrate: support non-lru movable page migration

We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.

page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 6c60d2b5 26-Jul-2016 Dave Chinner <dchinner@redhat.com>

fs/fs-writeback.c: add a new writeback list for sync

wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback. wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback. Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 70246286 19-Jul-2016 Christoph Hellwig <hch@lst.de>

block: get rid of bio_rw and READA

These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# a4a4f943 19-Jul-2016 Al Viro <viro@zeniv.linux.org.uk>

bdev: get rid of ->bd_inodes

Since 2006 we have ->i_bdev pinning bdev in question, so there's no
way to get to bdev ->evict_inode() while there's an aliasing inode
anywhere. In other words, the only place walking the list of aliases
is guaranteed to do it only when the list is empty...

Remove the detritus; it should've been done in "[PATCH] Fix a race
condition between ->i_mapping and iput()", but nobody had noticed it
back then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 81754357 11-Dec-2014 Seth Forshee <seth.forshee@canonical.com>

fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns

For filesystems mounted from a user namespace on-disk ids should
be translated relative to s_users_ns rather than init_user_ns.

When an id in the filesystem doesn't exist in s_user_ns the
associated id in the inode will be set to INVALID_[UG]ID, which
turns these into de facto "nobody" ids. This actually maps pretty
well into the way most code already works, and those places where
it didn't were fixed in previous patches. Moving forward vfs code
needs to be careful to handle instances where ids in inodes may
be invalid.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

# 0bd23d09 29-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Don't modify inodes with a uid or gid unknown to the vfs

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk. In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set. Reads of the inodes appear safe and
useful, but any write or modification is disallowed. The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values. After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr. As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed. evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 2d902671 30-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: merge .d_select_inode() into .d_real()

The two methods essentially do the same: find the real dentry/inode
belonging to an overlay dentry. The difference is in the usage:

vfs_open() uses ->d_select_inode() and expects the function to perform
copy-up if necessary based on the open flags argument.

file_dentry() uses ->d_real() passing in the overlay dentry as well as the
underlying inode.

vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
with a zero inode would have worked just as well here.

This patch merges the functionality of ->d_select_inode() into ->d_real()
by adding an 'open_flags' argument to the latter.

[Al Viro] Make the signature of d_real() match that of ->d_real() again.
And constify the inode argument, while we are at it.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

# cc50a07a 09-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag

Now that SB_I_NODEV controls the nodev behavior devpts can just clear
this flag during mount. Simplifying the code and making it easier
to audit how the code works. While still preserving the invariant
that s_iflags is only modified during mount.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# a2982cc9 09-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Generalize filesystem nodev handling.

Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 6e4eab57 24-May-2016 Eric W. Biederman <ebiederm@xmission.com>

fs: Add user namespace member to struct super_block

Start marking filesystems with a user namespace owner, s_user_ns. In
this change this is only used for permission checks of who may mount a
filesystem. Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.

The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that

The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a

The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be

The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information. The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.

For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.

For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.

Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.

Getting the details of setting s_user_ns correct has been
a long process. Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns. Thanks to Seth Forshee who has kept the work alive.

Thanks-to: Seth Forshee <seth.forshee@canonical.com>
Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

# d91ee87d 23-May-2016 Eric W. Biederman <ebiederm@xmission.com>

vfs: Pass data, ns, and ns->userns to mount_ns

Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 8654df4e 09-Jun-2016 Eric W. Biederman <ebiederm@xmission.com>

mnt: Refactor fs_fully_visible into mount_too_revealing

Replace the call of fs_fully_visible in do_new_mount from before the
new superblock is allocated with a call of mount_too_revealing after
the superblock is allocated. This winds up being a much better location
for maintainability of the code.

The first change this enables is the replacement of FS_USERNS_VISIBLE
with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type
to sb_iflags on the superblock.

Unfortunately mount_too_revealing fundamentally needs to touch
mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
times. If the mnt_flags did not need to be touched the code
could be easily moved into the filesystem specific mount code.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 28a8f0d3 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 4e1b2d52 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, fs, drivers: remove REQ_OP compat defs and related code

This patch drops the compat definition of req_op where it matches
the rq_flag_bits definitions, and drops the related old and compat
code that allowed users to set either the op or flags for the operation.

We also then store the operation in the bi_rw/cmd_flags field similar
to how we used to store the bio ioprio where it sat in the upper bits
of the field.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 8a4c1e42 05-Jun-2016 Mike Christie <mchristi@redhat.com>

direct-io: use bio set/get op accessors

This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op

It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# f2150821 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block: add REQ_OP definitions and helpers

The following patches separate the operation (WRITE, READ, DISCARD,
etc) from the rq_flag_bits flags. This patch adds definitions for
request/bio operations (REQ_OPs) and adds request/bio accessors to
get/set the op.

In this patch the REQ_OPs match the REQ rq_flag_bits ones
for compat reasons while all the code is converted to use the
op accessors in the set. In the last patches the op will become a
number and the accessors and helpers in this patch will be dropped
or updated.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 4e49ea4a 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block/fs/drivers: remove rw argument from submit_bio

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>

# 3767e255 27-May-2016 Al Viro <viro@zeniv.linux.org.uk>

switch ->setxattr() to passing dentry and inode separately

smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.

Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.

Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# acc93d30 07-May-2016 Dan Williams <dan.j.williams@intel.com>

Revert "block: enable dax for raw block devices"

This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937.

The functionality is superseded by the new "Device DAX" facility.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# a8078b1f 10-May-2016 Toshi Kani <toshi.kani@hpe.com>

block: Update blkdev_dax_capable() for consistency

blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
to remain as a separate interface for checking dax capability of
a raw block device.

Rename and relocate blkdev_dax_capable() to keep them maintained
consistently, and call bdev_direct_access() for the dax capability

There is no change in the behavior.

Link: https://lkml.org/lkml/2016/5/9/950
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

# 02fbd139 11-May-2016 Jan Kara <jack@suse.cz>

dax: Remove complete_unwritten argument

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

# ae05327a 12-May-2016 Al Viro <viro@zeniv.linux.org.uk>

ext4: switch to ->iterate_shared()

Note that we need relax_dir() equivalent for directories
locked shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 61922694 20-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

introduce a parallel variant of ->iterate()

New method: ->iterate_shared(). Same arguments as in ->iterate(),
called with the directory locked only shared. Once all filesystems
switch, the old one will be gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9902af79 15-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

parallel lookups: actual switch to rwsem


The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 84e710da 14-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

parallel lookups machinery, part 2

We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not. And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
* we verify that there's no hashed match
* existing in-lookup match gets hashed by another process
* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup. Then the above would turn into
* read the counter
* do dcache lookup
* if no matches found, check for in-lookup matches
* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# e2592217 07-Apr-2016 Christoph Hellwig <hch@lst.de>

fs: simplify the generic_write_sync prototype

The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# dde0c2e7 07-Apr-2016 Christoph Hellwig <hch@lst.de>


This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# c8b8e32d 07-Apr-2016 Christoph Hellwig <hch@lst.de>

direct-io: eliminate the offset argument to ->direct_IO

Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 1af5bb49 07-Apr-2016 Christoph Hellwig <hch@lst.de>

filemap: remove the pos argument to generic_file_direct_write

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# da20dfe6 21-Apr-2016 Kees Cook <keescook@chromium.org>

fs: fix over-zealous use of "const"

When I was fixing up const recommendations from checkpatch.pl, I went
overboard. This fixes the warning (during a W=1 build):

include/linux/fs.h:2627:74: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
static inline const char * const kernel_read_file_id_str(enum kernel_read_file_id id)

Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>

# 1284ab5b 20-Apr-2016 Mimi Zohar <zohar@linux.vnet.ibm.com>

fs: define a string representation of the kernel_read_file_id enumeration

A string representation of the kernel_read_file_id enumeration is
needed for displaying messages (eg. pr_info, auditing) that can be
used by multiple LSMs and the integrity subsystem. To simplify
keeping the list of strings up to date with the enumeration, this
patch defines two new preprocessing macros named __fid_enumify and
__fid_stringify to create the enumeration and an array of strings.
kernel_read_file_id_str() returns a string based on the enumeration.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
[kees: removed removal of my old version, constified pointer values]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>

# ce23e640 10-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

->getxattr(): pass dentry and inode as separate arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;


- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

expression E;
+ E

expression E;
+ E




expression E;

expression E;
- page_cache_get(E)
+ get_page(E)

expression E;
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# b8a7a3a6 24-Mar-2016 Andreas Gruenbacher <agruenba@redhat.com>

posix_acl: Inode acl caching fixes

When get_acl() is called for an inode whose ACL is not cached yet, the
get_acl inode operation is called to fetch the ACL from the filesystem.
The inode operation is responsible for updating the cached acl with
set_cached_acl(). This is done without locking at the VFS level, so
another task can call set_cached_acl() or forget_cached_acl() before the
get_acl inode operation gets to calling set_cached_acl(), and then
get_acl's call to set_cached_acl() results in caching an outdate ACL.

Prevent this from happening by setting the cached ACL pointer to a
task-specific sentinel value before calling the get_acl inode operation.
Move the responsibility for updating the cached ACL from the get_acl
inode operations to get_acl(). There, only set the cached ACL if the
sentinel value hasn't changed.

The sentinel values are chosen to have odd values. Likewise, the value
of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have
an even value (ACLs are aligned in memory). This allows to distinguish
uncached ACLs values from ACL objects.

In addition, switch from guarding inode->i_acl and inode->i_default_acl
upates by the inode->i_lock spinlock to using xchg() and cmpxchg().

Filesystems that do not want ACLs returned from their get_acl inode
operations to be cached must call forget_cached_acl() to prevent the VFS
from doing so.

(Patch written by Al Viro and Andreas Gruenbacher.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 49db08c3 19-Feb-2016 Linus Walleij <linus.walleij@linaro.org>

chrdev: emit a warning when we go below dynamic major range

Currently a dynamically allocated character device major is taken
from 254 and downward. This mechanism is used for RTC, IIO and a
few other subsystems.

The kernel currently has no check prevening these dynamic
allocations from eating into the assigned numbers at 233 and

In a recent test it was reported that so many dynamic device
majors were used on a test server, that the major number for
infiniband (231) was stolen. This occurred when allocating a new
major number for GPIO chips. The error messages from the kernel
were not helpful. (See: https://lkml.org/lkml/2016/2/14/124)

This patch adds a defined lower limit of the dynamic major
allocation region will henceforth emit a warning if we start to
eat into the assigned numbers. It does not do any semantic
changes and will not change the kernels behaviour: numbers will
still continue to be stolen, but we will know from dmesg what
is going on.

This also updates the Documentation/devices.txt to clearly
reflect that we are using this range of major numbers for dynamic

Reported-by: Ying Huang <ying.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

# 7df818b2 25-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

constify vfs_truncate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# d101a125 26-Mar-2016 Miklos Szeredi <miklos@szeredi.hu>

fs: add file_dentry()

This series fixes bugs in nfs and ext4 due to 4bacc9c9234c ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

Regular files opened on overlayfs will result in the file being opened on
the underlying filesystem, while f_path points to the overlayfs

This confuses filesystems which get the dentry from struct file and assume
it's theirs.

Add a new helper, file_dentry() [*], to get the filesystem's own dentry
from the file. This checks file->f_path.dentry->d_flags against
DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
set (this is the common, non-overlayfs case).

In the uncommon case it will call into overlayfs's ->d_real() to get the
underlying dentry, matching file_inode(file).

The reason we need to check against the inode is that if the file is copied
up while being open, d_real() would return the upper dentry, while the open
file comes from the lower dentry.

[*] If possible, it's better simply to use file_inode() instead.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: <stable@vger.kernel.org> # v4.2
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Axtens <dja@axtens.net>

# 378c6520 22-Mar-2016 Jann Horn <jann@thejh.net>

fs/coredump: prevent fsuid=0 dumps into user-controlled directories

This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:

- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)

Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.

To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 0b81d077 15-May-2015 Jaegeuk Kim <jaegeuk@kernel.org>

fs crypto: move per-file encryption from f2fs tree to fs/crypto

This patch adds the renamed functions moved from the f2fs crypto files.

1. definitions for per-file encryption used by ext4 and f2fs.

2. crypto.c for encrypt/decrypt functions
a. IO preparation:
- fscrypt_get_ctx / fscrypt_release_ctx
b. before IOs:
- fscrypt_encrypt_page
- fscrypt_decrypt_page
- fscrypt_zeroout_range
c. after IOs:
- fscrypt_decrypt_bio_pages
- fscrypt_pullback_bio_page
- fscrypt_restore_control_page

3. policy.c supporting context management.
a. For ioctls:
- fscrypt_process_policy
- fscrypt_get_policy
b. For context permission
- fscrypt_has_permitted_context
- fscrypt_inherit_context

4. keyinfo.c to handle permissions
- fscrypt_get_encryption_info
- fscrypt_free_encryption_info

5. fname.c to support filename encryption
a. general wrapper functions
- fscrypt_fname_disk_to_usr
- fscrypt_fname_usr_to_disk
- fscrypt_setup_filename
- fscrypt_free_filename

b. specific filename handling functions
- fscrypt_fname_alloc_buffer
- fscrypt_fname_free_buffer

6. Makefile and Kconfig

Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

# 9d95afd5 01-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

kill dentry_unhash()

the last user is gone

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 97be7ebe 03-Mar-2016 Christoph Hellwig <hch@lst.de>

vfs: add the RWF_HIPRI flag for preadv2/pwritev2

This adds a flag that tells the file system that this is a high priority
request for which it's worth to poll the hardware. The flag is purely
advisory and can be ignored if not supported.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 793b80ef 03-Mar-2016 Christoph Hellwig <hch@lst.de>

vfs: pass a flags argument to vfs_readv/vfs_writev

This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations. For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 7429b092 11-Apr-2014 Dmitry Kasatkin <dmitry.kasatkin@huawei.com>

ima: load policy using path

We currently cannot do appraisal or signature vetting of IMA policies
since we currently can only load IMA policies by writing the contents
of the policy directly in, as follows:

cat policy-file > <securityfs>/ima/policy

If we provide the kernel the path to the IMA policy so it can load
the policy itself it'd be able to later appraise or vet the file
signature if it has one. This patch adds support to load the IMA
policy with a given path as follows:

echo /etc/ima/ima_policy > /sys/kernel/security/ima/policy

Changelog v4+:
- moved kernel_read_file_from_path() error messages to callers
- moved kernel_read_file_from_path() to a separate patch
- after re-ordering the patches, replace calling integrity_kernel_read()
to read the file with kernel_read_file_from_path() (Mimi)
- Patch description re-written by Luis R. Rodriguez

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>

# b804defe 14-Jan-2016 Mimi Zohar <zohar@linux.vnet.ibm.com>

kexec: replace call to copy_file_from_fd() with kernel version

Replace copy_file_from_fd() with kernel_read_file_from_fd().

are defined for measuring, appraising or auditing the kexec image and

Changelog v3:
- return -EBADF, not -ENOEXEC
- identifier change
- split patch, moving copy_file_from_fd() to a separate patch
- split patch, moving IMA changes to a separate patch
- use kstat file size type loff_t, not size_t
- Calculate the file hash from the in memory buffer - Dave Young

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Dave Young <dyoung@redhat.com>

# a1db7420 30-Dec-2015 Mimi Zohar <zohar@linux.vnet.ibm.com>

module: replace copy_module_from_fd with kernel version

Replace copy_module_from_fd() with kernel_read_file_from_fd().

Although none of the upstreamed LSMs define a kernel_module_from_file
hook, IMA is called, based on policy, to prevent unsigned kernel modules
from being loaded by the original kernel module syscall and to
measure/appraise signed kernel modules.

The security function security_kernel_module_from_file() was called prior
to reading a kernel module. Preventing unsigned kernel modules from being
loaded by the original kernel module syscall remains on the pre-read
kernel_read_file() security hook. Instead of reading the kernel module
twice, once for measuring/appraising and again for loading the kernel
module, the signature validation is moved to the kernel_post_read_file()
security hook.

This patch removes the security_kernel_module_from_file() hook and security

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>

# b844f0ec 01-Feb-2016 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: define kernel_copy_file_from_fd()

This patch defines kernel_read_file_from_fd(), a wrapper for the VFS
common kernel_read_file().

- Separated from the kernel modules patch
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>

# e40ba6d5 18-Nov-2015 Mimi Zohar <zohar@linux.vnet.ibm.com>

firmware: replace call to fw_read_file_contents() with kernel version

Replace the fw_read_file_contents with kernel_file_read_from_path().

Although none of the upstreamed LSMs define a kernel_fw_from_file hook,
IMA is called by the security function to prevent unsigned firmware from
being loaded and to measure/appraise signed firmware, based on policy.

Instead of reading the firmware twice, once for measuring/appraising the
firmware and again for reading the firmware contents into memory, the
kernel_post_read_file() security hook calculates the file hash based on
the in memory file buffer. The firmware is read once.

This patch removes the LSM kernel_fw_from_file() hook and security call.

Changelog v4+:
- revert dropped buf->size assignment - reported by Sergey Senozhatsky
- remove kernel_fw_from_file hook
- use kernel_file_read_from_path() - requested by Luis
- reordered and squashed firmware patches
- fix MAX firmware size (Kees Cook)

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>

# 09596b94 18-Nov-2015 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: define kernel_read_file_from_path

This patch defines kernel_read_file_from_path(), a wrapper for the VFS
common kernel_read_file().

- revert error msg regression - reported by Sergey Senozhatsky
- Separated from the IMA patch

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

# bc8ca5b9 24-Jan-2016 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: define kernel_read_file_id enumeration

To differentiate between the kernel_read_file() callers, this patch
defines a new enumeration named kernel_read_file_id and includes the
caller identifier as an argument.


Changelog v3:
- Replace the IMA specific enumeration with a generic one.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

# b44a7dfc 28-Dec-2015 Mimi Zohar <zohar@linux.vnet.ibm.com>

vfs: define a generic function to read a file from the kernel

For a while it was looked down upon to directly read files from Linux.
These days there exists a few mechanisms in the kernel that do just
this though to load a file into a local buffer. There are minor but
important checks differences on each. This patch set is the first
attempt at resolving some of these differences.

This patch introduces a common function for reading files from the kernel
with the corresponding security post-read hook and function.

Changelog v4+:
- export security_kernel_post_read_file() - Fengguang Wu
- additional bounds checking - Luis
- To simplify patch review, re-ordered patches

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

# 187372a3 07-Feb-2016 Christoph Hellwig <hch@lst.de>

direct-io: always call ->end_io if non-NULL

This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 9f4736fe 28-Jan-2016 Dan Williams <dan.j.williams@intel.com>

block: revert runtime dax control of the raw block device

Dynamically enabling DAX requires that the page cache first be flushed
and invalidated. This must occur atomically with the change of DAX mode
otherwise we confuse the fsync/msync tracking and violate data
durability guarantees. Eliminate the possibilty of DAX-disabled to
DAX-enabled transitions for now and revisit this for the next cycle.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# 65f87ee7 25-Jan-2016 Dan Williams <dan.j.williams@intel.com>

fs, block: force direct-I/O for dax-enabled block devices

Similar to the file I/O path, re-direct all I/O to the DAX path for I/O
to a block-device special file. Both regular files and device special
files can use the common filp->f_mapping->host lookup to determing is
DAX is enabled.

Otherwise, we confuse the DAX code that does not expect to find live
data in the page cache:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 7676 at mm/filemap.c:217
Modules linked in:
CPU: 0 PID: 7676 Comm: a.out Not tainted 4.4.0+ #276
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
00000000ffffffff ffff88006d3f7738 ffffffff82999e2d 0000000000000000
ffff8800620a0000 ffffffff86473d20 ffff88006d3f7778 ffffffff81352089
ffffffff81658d36 ffffffff86473d20 00000000000000d9 ffffea0000009d60
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
[<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
[<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
[<ffffffff81658d36>] __delete_from_page_cache+0x9f6/0xb60 mm/filemap.c:217
[<ffffffff81658fb2>] delete_from_page_cache+0x112/0x200 mm/filemap.c:244
[<ffffffff818af369>] __dax_fault+0x859/0x1800 fs/dax.c:487
[<ffffffff8186f4f6>] blkdev_dax_fault+0x26/0x30 fs/block_dev.c:1730
[< inline >] wp_pfn_shared mm/memory.c:2208
[<ffffffff816e9145>] do_wp_page+0xc85/0x14f0 mm/memory.c:2307
[< inline >] handle_pte_fault mm/memory.c:3323
[< inline >] __handle_mm_fault mm/memory.c:3417
[<ffffffff816ecec3>] handle_mm_fault+0x2483/0x4640 mm/memory.c:3446
[<ffffffff8127eff6>] __do_page_fault+0x376/0x960 arch/x86/mm/fault.c:1238
[<ffffffff8127f738>] trace_do_page_fault+0xe8/0x420 arch/x86/mm/fault.c:1331
[<ffffffff812705c4>] do_async_page_fault+0x14/0xd0 arch/x86/kernel/kvm.c:264
[<ffffffff86338f78>] async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:986
[<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a
---[ end trace dae21e0f85f1f98c ]---

Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Matthew Wilcox <willy@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# f9fe48be 22-Jan-2016 Ross Zwisler <zwisler@kernel.org>

dax: support dirty DAX entries in radix tree

Add support for tracking dirty DAX entries in the struct address_space
radix tree. This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages. These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third. We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage. This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.

The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes. These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 5a023cdb 30-Nov-2015 Dan Williams <dan.j.williams@intel.com>

block: enable dax for raw block devices

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem. This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests. It can be disabled / enabled dynamically via the new BLKDAXSET

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

# 66cf191f 07-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

compat_ioctl: don't pass fd around when not needed

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# f27a0fe0 07-Jan-2016 Jeff Layton <jeff.layton@primarydata.com>

locks: pass inode pointer to locks_free_lock_context

...so we can print information about it if there are leaked locks.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>

# 54dbc151 19-Dec-2015 Darrick J. Wong <darrick.wong@oracle.com>

vfs: hoist the btrfs deduplication ioctl to the vfs

Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# fceef393 29-Dec-2015 Al Viro <viro@zeniv.linux.org.uk>

switch ->get_link() to delayed_call, kill ->put_link()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# cd3417c8 29-Dec-2015 Al Viro <viro@zeniv.linux.org.uk>

kill free_page_put_link()

all callers are better off with kfree_put_link()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# b25472f9 05-Dec-2015 Al Viro <viro@zeniv.linux.org.uk>

new helpers: no_seek_end_llseek{,_size}()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 6b255391 17-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

replace ->follow_link() with new method that could stay in RCU mode

new method: ->get_link(); replacement of ->follow_link(). The differences
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 21fc61c7 16-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

don't put symlink bodies in pagecache into highmem

kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 04b38d60 02-Dec-2015 Christoph Hellwig <hch@lst.de>

vfs: pull btrfs clone API to vfs layer

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development. To avoid growth of various slightly incompatible
implementations, add one to the VFS. Note that clones are different from
file copies in several ways:

- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure. The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# acc15575 02-Dec-2015 Christoph Hellwig <hch@lst.de>

locks: new locks_mandatory_area calling convention

Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also simplify the read/write selection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# d37177ba 19-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

fs/attr.c: is_sxid can be boolean

This patch makes is_sxid return bool to improve readability
due to this particular function only using either one or zero
as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 0e3ef1fe 19-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

fs/bad_inode.c: is_bad_inode can be boolean

This patch makes is_bad_inode return bool to improve
readability due to this particular function only using either
one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# a6e5787f 16-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

fs/dcache.c: is_subdir can be boolean

This patch makes is_subdir return bool to improve
readability due to this particular function only using either
one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 25ab4c9b 16-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

fs/namespace.c: path_is_under can be boolean

This patch makes path_is_under return bool to improve
readability due to this particular function only using either
one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 3e004eea 16-Nov-2015 Yaowei Bai <baiyaowei@cmss.chinamobile.com>

fs/block_dev.c: make sb_is_blkdev_sb return bool when CONFIG_BLOCK undefined

Currently when CONFIG_BLOCK is defined sb_is_blkdev_sb returns bool,
while when CONFIG_BLOCK is not defined it returns int. Let's keep
consistent to make sb_is_blkdev_sb return bool as well when CONFIG_BLOCK
isn't defined.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 29732938 10-Nov-2015 Zach Brown <zab@redhat.com>

vfs: add copy_file_range syscall and vfs helper

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
Change flags parameter from int to unsigned int,
Add function to include/linux/syscalls.h,
Check copy len after file open mode,
Don't forbid ranges inside the same file,
Use rw_verify_area() to veriy ranges,
Use file_out rather than file_in,
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9e8925b6 16-Nov-2015 Jeff Layton <jeff.layton@primarydata.com>

locks: Allow disabling mandatory locking at compile time

Mandatory locking appears to be almost unused and buggy and there
appears no real interest in doing anything with it. Since effectively
no one uses the code and since the code is buggy let's allow it to be
disabled at compile time. I would just suggest removing the code but
undoubtedly that will break some piece of userspace code somewhere.

For the distributions that don't care about this piece of code
this gives a nice starting point to make mandatory locking go away.

Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jeff Layton <jeff.layton@primarydata.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>

# c8fffa64 08-Oct-2015 Ross Zwisler <zwisler@kernel.org>

vfs: remove stale comment in inode_operations

The big warning comment that is currently at the end of struct
inode_operations was added as part of this commit:

4aa7c6346be3 ("vfs: add i_op->dentry_open()")

It was added to warn people not to use the newly added 'dentry_open'
function pointer.

This function pointer was removed as part of this commit:

4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and
f_inode to the underlay")

The comment was left behind and now refers to nothing, so remove it.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# dece1635 05-Nov-2015 Jens Axboe <axboe@fb.com>

block: change ->make_request_fn() and users to return a queue cookie

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>

# aa750fd7 05-Nov-2015 Junichi Nomura <j-nomura@ce.jp.nec.com>

mm/filemap.c: make global sync not clear error status of individual inodes

filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

Application System admin
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 616fb38f 22-Oct-2015 Benjamin Coddington <bcodding@redhat.com>

locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait

All callers use locks_lock_inode_wait() instead.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>

# e55c34a6 22-Oct-2015 Benjamin Coddington <bcodding@redhat.com>

locks: introduce locks_lock_inode_wait()

Users of the locks API commonly call either posix_lock_file_wait() or
flock_lock_file_wait() depending upon the lock type. Add a new function
locks_lock_inode_wait() which will check and call the correct function for
the type of lock passed in.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>

# c94c2acf 08-Sep-2015 Matthew Wilcox <willy@infradead.org>

dax: move DAX-related functions to a new header

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# 5477e70a 04-Sep-2015 Oleg Nesterov <oleg@redhat.com>

mm: move ->mremap() from file_operations to vm_operations_struct

vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
way ->mremap() can have more users. Say, vdso.

While at it, s/aio_ring_remap/aio_ring_mremap/.

Note: this is the minimal change before ->mremap() finds another user in
file_operations; this method should have more arguments, and it can be
used to kill arch_remap().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# c7f54084 04-Mar-2015 Dave Chinner <dchinner@redhat.com>

inode: rename i_wb_list to i_io_list

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>

# e97fedb9 04-Mar-2015 Dave Chinner <dchinner@redhat.com>

sync: serialise per-superblock sync operations

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>

# 74278da9 03-Mar-2015 Dave Chinner <dchinner@redhat.com>

inode: convert inode_sb_list_lock to per-sb

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>

# cbedaac6 12-Mar-2015 Josef Bacik <jbacik@fb.com>

inode: add hlist_fake to avoid the inode hash lock in evict

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this add hlist_fake to test if the inode isn't actually
hashed to avoid taking the hash lock on inodes that have never been
hashed. Based on Dave Chinner's

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

basd on Al's suggestions. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>

# 8129ed29 11-Aug-2015 Oleg Nesterov <oleg@redhat.com>

change sb_writers to use percpu_rw_semaphore

We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.

This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.

This change tries to address the following problems:

- Firstly, __sb_start_write() looks simply buggy. It does
__sb_end_write() if it sees ->frozen, but if it migrates
to another CPU before percpu_counter_dec(), sb_wait_write()
can wrongly succeed if there is another task which holds
the same "semaphore": sb_wait_write() can miss the result
of the previous percpu_counter_inc() but see the result
of this percpu_counter_dec().

- As Dave Hansen reports, it is suboptimal. The trivial
microbenchmark that writes to a tmpfs file in a loop runs
12% faster if we change this code to rely on RCU and kill
the memory barriers.

- This code doesn't look simple. It would be better to rely
on the generic locking code.

According to Dave, this change adds the same performance

Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:

- This will be "fixed" by the rcu_sync changes we are going
to merge. After that freeze_super()->percpu_down_write()
will use synchronize_sched(), and thaw_super() won't use
synchronize() at all.

This doesn't need any changes in fs/super.c.

- Once we merge rcu_sync changes, we can also change super.c
so that all wb_write->rw_sem's will share the single ->rss
in struct sb_writes, then freeze_super() will need only one

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>

# 853b39a7 22-Jul-2015 Oleg Nesterov <oleg@redhat.com>

shift percpu_counter_destroy() into destroy_super_work()

Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.

We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.

And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.

So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>

# bee9182d 19-Jul-2015 Oleg Nesterov <oleg@redhat.com>

introduce __sb_writers_{acquired,release}() helpers

Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>

# c87fb4a3 05-Aug-2015 J. Bruce Fields <bfields@redhat.com>

lockd: NLM grace period shouldn't block NFSv4 opens

NLM locks don't conflict with NFSv4 share reservations, so we're not
going to learn anything new by watiting for them.

They do conflict with NFSv4 locks and with delegations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>

# 4248b0da 06-Aug-2015 Mel Gorman <mgorman@suse.de>

fs, file table: reinit files_stat.max_files after deferred memory initialisation

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

# ee296d7c 11-Jul-2015 Jeff Layton <jeff.layton@primarydata.com>

locks: inline posix_lock_file_wait and flock_lock_file_wait

They just call file_inode and then the corresponding *_inode_file_wait
function. Just make them static inlines instead.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>

# 29d01b22 11-Jul-2015 Jeff Layton <jeff.layton@primarydata.com>

locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait

Allow callers to pass in an inode instead of a filp.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>

# 90f8572b 29-Jun-2015 Eric W. Biederman <ebiederm@xmission.com>

vfs: Commit to never having exectuables on proc and sysfs.

Today proc and sysfs do not contain any executable files. Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.

Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.

Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.

The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.

This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.

Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc. Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# fbabfd0f 09-May-2015 Eric W. Biederman <ebiederm@xmission.com>

fs: Add helper functions for permanently empty directories.

To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories. Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

# 45f147a1 21-May-2015 Jan Kara <jack@suse.cz>

fs: Call security_ops->inode_killpriv on truncate

Comment in include/linux/security.h says that ->inode_killpriv() should
be called when setuid bit is being removed and that similar security
labels (in fact this applies only to file capabilities) should be
removed at this time as well. However we don't call ->inode_killpriv()
when we remove suid bit on truncate.

We fix the problem by calling ->inode_need_killpriv() and subsequently
->inode_killpriv() on truncate the same way as we do it on file write.

After this patch there's only one user of should_remove_suid() - ocfs2 -
and indeed it's buggy because it doesn't call ->inode_killpriv() on
write. However fixing it is difficult because of special locking

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# dbfae0cd 21-May-2015 Jan Kara <jack@suse.cz>

fs: Provide function telling whether file_remove_privs() will do anything

Provide function telling whether file_remove_privs() will do anything.
Currently we only have should_remove_suid() and that does something
slightly different.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 5fa8e0a1 21-May-2015 Jan Kara <jack@suse.cz>

fs: Rename file_remove_suid() to file_remove_privs()

file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 9bf39ab2 19-Jun-2015 Miklos Szeredi <mszeredi@suse.cz>

vfs: add file_path() helper

d_path(&file->f_path, ...);
file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 4bacc9c9 18-Jun-2015 David Howells <dhowells@redhat.com>

overlayfs: Make f_path always point to the overlay and f_inode to the underlay

Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).

Using my union testsuite to set things up, before the patch I see:

[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
Device: 23h/35d Inode: 13381 Links: 1
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
Device: 23h/35d Inode: 13381 Links: 1

After the patch:

[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
Device: 23h/35d Inode: 40346 Links: 1
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
Device: 23h/35d Inode: 40346 Links: 1

Note the change in where /proc/$$/fd/5 points to in the ls command. It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).

The inode accessed, however, is the lower layer. The union layer is on device
25h/37d and the upper layer on 24h/36d.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

# 46b15caa 16-Jun-2015 Tejun Heo <tj@kernel.org>

vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB

FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a
per-super_block flag.

super_block->s_flags carries some internal flags in the high bits but
it's exposd to userland through uapi header and running out of space
anyway. This patch adds a new field super_block->s_iflags to carry
kernel-internal flags. It is currently only used by the new
SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
consistency with other super_block flags.

ext2_fill_super() is updated to set SB_I_CGROUPWB.

v2: Added super_block->s_iflags instead of stealing another high bit
from sb->s_flags as suggested by Christoph and Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

# ce5c5d55 03-Jun-2015 Dave Chinner <dchinner@redhat.com>

dax: expose __dax_fault for filesystems with locking constraints

Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside.

This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling. It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# e842f290 03-Jun-2015 Dave Chinner <dchinner@redhat.com>

dax: don't abuse get_block mapping for endio callbacks

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>

# 682aa8e1 28-May-2015 Tejun Heo <tj@kernel.org>

writeback: implement unlocked_inode_to_wb transaction and use it for stat updates

The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.

This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.

The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some

This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.

In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against

This patch still doesn't implement the actual switching.

v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.

v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.


Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 2a814908 28-May-2015 Tejun Heo <tj@kernel.org>

writeback: implement foreign cgroup inode detection

As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode. While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).

To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and will transfer the ownership to it. To avoid
unnnecessary oscillation, the detection mechanism keeps track of
history and gives out the switch verdict only if the foreign usage
pattern is stable over a certain amount of time and/or writeback

The detection mechanism has fairly low space and computation overhead.
It adds 8 bytes to struct inode (one int and two u16's) and minimal
amount of calculation per IO. The detection mechanism converges to
the correct answer usually in several seconds of IO time when there's
a clear majority dirtier. Even when there isn't, it can reach an
acceptable answer fairly quickly under most circumstances.

Please see wb_detach_inode() for more details.

This patch only implements detection. Following patches will
implement actual switching.

v2: wbc_account_io() now checks whether the wbc is associated with a
wb before dereferencing it. This can happen when pageout() is
writing pages directly without going through the usual writeback
path. As pageout() path is single-threaded, we don't want it to
be blocked behind a slow cgroup and ultimately want it to delegate
actual writing to the usual writeback path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 52ebea74 22-May-2015 Tejun Heo <tj@kernel.org>

writeback: make backing_dev_info host cgroup-specific bdi_writebacks

For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback). This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg. This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg. This means that there may be multiple
wb's of a bdi mapped to the same blkcg. As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state. This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
by its memcg id. Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
mapping_cap_account_dirty() block where it's known to be !NULL.
Also, an unnecessary NULL check before kfree() removed. Both
detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

# 89e9b9e0 22-May-2015 Tejun Heo <tj@kernel.org>


cgroup writeback requires support from both bdi and filesystem sides.
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

# a212b105 22-May-2015 Tejun Heo <tj@kernel.org>

bdi: make inode_to_bdi() inline

Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency. The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline. blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from

While at it, make sb_is_blkdev_sb() return bool instead of int.

v2: Fixed typo in description as suggested by Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

# be324177 05-May-2015 Jarod Wilson <jarod@redhat.com>

block: export blkdev_reread_part() and __blkdev_reread_part()

This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().

For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.

CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>