History log of /linux-master/fs/overlayfs/readdir.c
Revision Date Author Comments
# 420332b9 19-Jan-2024 Amir Goldstein <amir73il@gmail.com>

ovl: mark xwhiteouts directory with overlay.opaque='x'

An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').

This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.

This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.

Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.

Fixes: bc8df7a3dc03 ("ovl: Add an alternative type of whiteout")
Cc: <stable@vger.kernel.org> # v6.7
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Tested-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 02d70090 19-Nov-2023 Amir Goldstein <amir73il@gmail.com>

ovl: remove redundant ofs->indexdir member

When the index feature is disabled, ofs->indexdir is NULL.
When the index feature is enabled, ofs->indexdir has the same value as
ofs->workdir and takes an extra reference.

This makes the code harder to understand when it is not always clear
that ofs->indexdir in one function is the same dentry as ofs->workdir
in another function.

Remove this redundancy, by referencing ofs->workdir directly in index
helpers and by using the ovl_indexdir() accessor in generic code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# bc8df7a3 23-Aug-2023 Alexander Larsson <alexl@redhat.com>

ovl: Add an alternative type of whiteout

An xattr whiteout (called "xwhiteout" in the code) is a reguar file of
zero size with the "overlay.whiteout" xattr set. A file like this in a
directory with the "overlay.whiteouts" xattrs set will be treated the
same way as a regular whiteout.

The "overlay.whiteouts" directory xattr is used in order to
efficiently handle overlay checks in readdir(), as we only need to
checks xattrs in affected directories.

The advantage of this kind of whiteout is that they can be escaped
using the standard overlay xattr escaping mechanism. So, a file with a
"overlay.overlay.whiteout" xattr would be unescaped to
"overlay.whiteout", which could then be consumed by another overlayfs
as a whiteout.

Overlayfs itself doesn't create whiteouts like this, but a userspace
mechanism could use this alternative mechanism to convert images that
may contain whiteouts to be used with overlayfs.

To work as a whiteout for both regular overlayfs mounts as well as
userxattr mounts both the "user.overlay.whiteout*" and the
"trusted.overlay.whiteout*" xattrs will need to be created.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 3e327154 05-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

vfs: get rid of old '->iterate' directory operation

All users now just use '->iterate_shared()', which only takes the
directory inode lock for reading.

Filesystems that never got convered to shared mode now instead use a
wrapper that drops the lock, re-takes it in write mode, calls the old
function, and then downgrades the lock back to read mode.

This way the VFS layer and other callers no longer need to care about
filesystems that never got converted to the modern era.

The filesystems that use the new wrapper are ceph, coda, exfat, jfs,
ntfs, ocfs2, overlayfs, and vboxsf.

Honestly, several of them look like they really could just iterate their
directories in shared mode and skip the wrapper entirely, but the point
of this change is to not change semantics or fix filesystems that
haven't been fixed in the last 7+ years, but to finally get rid of the
dual iterators.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# dcb399de 24-May-2023 Amir Goldstein <amir73il@gmail.com>

ovl: pass ovl_fs to xino helpers

Internal ovl methods should use ovl_fs and not sb as much as
possible.

Use a constant_table to translate from enum xino mode to string
in preperation for new mount api option parsing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 1fa9c5c5 07-Oct-2022 Miklos Szeredi <mszeredi@redhat.com>

ovl: use inode instead of dentry where possible

Passing dentry to some helpers is unnecessary. Simplify these cases.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# af4dcb6d 04-Oct-2022 Amir Goldstein <amir73il@gmail.com>

ovl: use plain list filler in indexdir and workdir cleanup

Those two cleanup routines are using the helper ovl_dir_read() with the
merge dir filler, which populates an rb tree, that is never used.

The index dir entry names all have a long (42 bytes) constant prefix, so it
is not surprising that perf top has demostrated high CPU usage by rb tree
population during cleanup of a large index dir:

- 9.53% ovl_fill_merge
- 78.41% ovl_cache_entry_find_link.constprop.27
+ 72.11% strncmp

Use the plain list filler that does not populate the unneeded rb tree.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2d343087 04-Aug-2022 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: constify path

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 25885a35 16-Aug-2022 Al Viro <viro@zeniv.linux.org.uk>

Change calling conventions for filldir_t

filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop". Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).

So let's just return bool ("should we keep going?") - it's less confusing
that way. The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
do something for everything in directory, until we run into problem
and
find an entry in directory and do something to it.

The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.

"true means stop" would be more natural for the second group; "true
means keep going" - for the first one. I tried both variants and
the things like
if allocation failed
something = -ENOMEM;
return true;
just looked unnatural and asking for trouble.

[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ba9ea771 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: handle idmappings for layer lookup

Make the two places where lookup helpers can be called either on lower
or upper layers take the mount's idmapping into account. To this end we
pass down the mount in struct ovl_lookup_data. It can later also be used
to construct struct path for various other helpers. This is needed to
support idmapped base layers with overlay.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 22f289ce 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: use ovl_lookup_upper() wrapper

Introduce ovl_lookup_upper() as a simple wrapper around lookup_one().
Make it clear in the helper's name that this only operates on the upper
layer. The wrapper will take upper layer's idmapping into account when
checking permission in lookup_one().

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 576bb263 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: pass ofs to creation operations

Pass down struct ovl_fs to all creation helpers so we can ultimately
retrieve the relevant upper mount and take the mount's idmapping into
account when creating new filesystem objects. This is needed to support
idmapped base layers with overlay.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c914c0e2 03-Apr-2022 Amir Goldstein <amir73il@gmail.com>

ovl: use wrappers to all vfs_*xattr() calls

Use helpers ovl_*xattr() to access user/trusted.overlay.* xattrs
and use helpers ovl_do_*xattr() to access generic xattrs. This is a
preparatory patch for using idmapped base layers with overlay.

Note that a few of those places called vfs_*xattr() calls directly to
reduce the amount of debug output. But as Miklos pointed out since
overlayfs has been stable for quite some time the debug output isn't all
that relevant anymore and the additional debug in all locations was
actually quite helpful when developing this patch series.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 9011c279 26-Apr-2021 Amir Goldstein <amir73il@gmail.com>

ovl: skip stale entries in merge dir cache iteration

On the first getdents call, ovl_iterate() populates the readdir cache
with a list of entries, but for upper entries with origin lower inode,
p->ino remains zero.

Following getdents calls traverse the readdir cache list and call
ovl_cache_update_ino() for entries with zero p->ino to lookup the entry
in the overlay and return d_ino that is consistent with st_ino.

If the upper file was unlinked between the first getdents call and the
getdents call that lists the file entry, ovl_cache_update_ino() will not
find the entry and fall back to setting d_ino to the upper real st_ino,
which is inconsistent with how this object was presented to users.

Instead of listing a stale entry with inconsistent d_ino, simply skip
the stale entry, which is better for users.

xfstest overlay/077 is failing without this patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 65cd913e 10-Apr-2021 Amir Goldstein <amir73il@gmail.com>

ovl: invalidate readdir cache on changes to dir with origin

The test in ovl_dentry_version_inc() was out-dated and did not include
the case where readdir cache is used on a non-merge dir that has origin
xattr, indicating that it may contain leftover whiteouts.

To make the code more robust, use the same helper ovl_dir_is_real()
to determine if readdir cache should be used and if readdir cache should
be invalidated.

Fixes: b79e05aaa166 ("ovl: no direct iteration for dir with origin xattr")
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxht70nODhNHNwGFMSqDyOKLXOKrY0H6g849os4BQ7cokA@mail.gmail.com/
Cc: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c4fe8aef 08-Apr-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: remove unneeded ioctls

The FS_IOC_[GS]ETFLAGS/FS_IOC_FS[GS]ETXATTR ioctls are now handled via the
fileattr api. The only unconverted filesystem remaining is CIFS and it is
not allowed to be overlayed due to case insensitive filenames.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 335d3fc5 07-Jan-2021 Sargun Dhillon <sargun@sargun.me>

ovl: implement volatile-specific fsync error behaviour

Overlayfs's volatile option allows the user to bypass all forced sync calls
to the upperdir filesystem. This comes at the cost of safety. We can never
ensure that the user's data is intact, but we can make a best effort to
expose whether or not the data is likely to be in a bad state.

The best way to handle this in the time being is that if an overlayfs's
upperdir experiences an error after a volatile mount occurs, that error
will be returned on fsync, fdatasync, sync, and syncfs. This is
contradictory to the traditional behaviour of VFS which fails the call
once, and only raises an error if a subsequent fsync error has occurred,
and been raised by the filesystem.

One awkward aspect of the patch is that we have to manually set the
superblock's errseq_t after the sync_fs callback as opposed to just
returning an error from syncfs. This is because the call chain looks
something like this:

sys_syncfs ->
sync_filesystem ->
__sync_filesystem ->
/* The return value is ignored here
sb->s_op->sync_fs(sb)
_sync_blockdev
/* Where the VFS fetches the error to raise to userspace */
errseq_check_and_advance

Because of this we call errseq_set every time the sync_fs callback occurs.
Due to the nature of this seen / unseen dichotomy, if the upperdir is an
inconsistent state at the initial mount time, overlayfs will refuse to
mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
will increment on error until the user calls syncfs.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: c86243b090bc ("ovl: provide a mount option "volatile"")
Cc: stable@vger.kernel.org
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b854cc65 04-Jan-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: avoid deadlock on directory ioctl

The function ovl_dir_real_file() currently uses the inode lock to serialize
writes to the od->upperfile field.

However, this function will get called by ovl_ioctl_set_flags(), which
utilizes the inode lock too. In this case ovl_dir_real_file() will try to
claim a lock that is owned by a function in its call stack, which won't get
released before ovl_dir_real_file() returns.

Fix by replacing the open coded compare and exchange by an explicit atomic
op.

Fixes: 61536bed2149 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories")
Cc: stable@vger.kernel.org # v5.10
Reported-by: Icenowy Zheng <icenowy@aosc.io>
Tested-by: Icenowy Zheng <icenowy@aosc.io>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 61536bed 29-Sep-2020 Amir Goldstein <amir73il@gmail.com>

ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories

[S|G]ETFLAGS and FS[S|G]ETXATTR ioctls are applicable to both files and
directories, so add ioctl operations to dir as well.

We teach ovl_real_fdget() to get the realfile of directories which use
a different type of file->private_data.

Ifdef away compat ioctl implementation to conform to standard practice.

With this change, xfstest generic/079 which tests these ioctls on files
and directories passes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 610afc0b 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: pass ovl_fs down to functions accessing private xattrs

This paves the way for optionally using the "user.overlay." xattr
namespace.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c86243b0 31-Aug-2020 Vivek Goyal <vgoyal@redhat.com>

ovl: provide a mount option "volatile"

Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build requirement
is such that they don't care if a node goes down while build was still
going on. In that case, they will simply throw away unfinished layer and
start new build. So they don't care about syncing intermediate state to the
disk and hence don't want to pay the price associated with sync.

So they are asking for mount options where they can disable sync on overlay
mount point.

They primarily seem to have two use cases.

- For building images, they will mount overlay with nosync and then sync
upper layer after unmounting overlay and reuse upper as lower for next
layer.

- For running containers, they don't seem to care about syncing upper layer
because if node goes down, they will simply throw away upper layer and
create a fresh one.

So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if system
crashes or shuts down and start fresh.

With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to 25
seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take out the
network operations latency out of the measurement.

Giuseppe is also looking to cut down on number of iops done on the disk. He
is complaining that often in cloud their VMs are throttled if they cross
the limit. This option can help them where they reduce number of iops (by
cutting down on frequent sync and writebacks).

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 235ce9ed 30-Aug-2020 Amir Goldstein <amir73il@gmail.com>

ovl: check for incompatible features in work dir

An incompatible feature is marked by a non-empty directory nested
2 levels deep under "work" dir, e.g.:
workdir/work/incompat/volatile.

This commit checks for marked incompat features, warns about them
and fails to mount the overlay, for example:
overlayfs: overlay with incompat feature 'volatile' cannot be mounted

Very old kernels (i.e. v3.18) will fail to remove a non-empty "work"
dir and fail the mount. Newer kernels will fail to remove a "work"
dir with entries nested 3 levels and fall back to read-only mount.

User mounting with old kernel will see a warning like these in dmesg:
overlayfs: cleanup of 'incompat/...' failed (-39)
overlayfs: cleanup of 'work/incompat' failed (-39)
overlayfs: cleanup of 'ovl-work/work' failed (-39)
overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17);
mounting read-only

These warnings should give the hint to the user that:
1. mount failure is caused by backward incompatible features
2. mount failure can be resolved by manually removing the "work" directory

There is nothing preventing users on old kernels from manually removing
workdir entirely or mounting overlay with a new workdir, so this is in
no way a full proof backward compatibility enforcement, but only a best
effort.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 08f4c7c8 04-Jun-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: add accessor for ofs->upper_mnt

Next patch will remove ofs->upper_mnt, so add an accessor function for this
field.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 48bd024b 02-Jun-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: switch to mounter creds in readdir

In preparation for more permission checking, override credentials for
directory operations on the underlying filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 130fdbc3 02-Jun-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: pass correct flags for opening real directory

The three instances of ovl_path_open() in overlayfs/readdir.c do three
different things:

- pass f_flags from overlay file
- pass O_RDONLY | O_DIRECTORY
- pass just O_RDONLY

The value of f_flags can be (other than O_RDONLY):

O_WRONLY - not possible for a directory
O_RDWR - not possible for a directory
O_CREAT - masked out by dentry_open()
O_EXCL - masked out by dentry_open()
O_NOCTTY - masked out by dentry_open()
O_TRUNC - masked out by dentry_open()
O_APPEND - no effect on directory ops
O_NDELAY - no effect on directory ops
O_NONBLOCK - no effect on directory ops
__O_SYNC - no effect on directory ops
O_DSYNC - no effect on directory ops
FASYNC - no effect on directory ops
O_DIRECT - no effect on directory ops
O_LARGEFILE - ?
O_DIRECTORY - only affects lookup
O_NOFOLLOW - only affects lookup
O_NOATIME - overlay sets this unconditionally in ovl_path_open()
O_CLOEXEC - only affects fd allocation
O_PATH - no effect on directory ops
__O_TMPFILE - not possible for a directory


Fon non-merge directories we use the underlying filesystem's iterate; in
this case honor O_LARGEFILE from the original file to make sure that open
doesn't get rejected.

For merge directories it's safe to pass O_LARGEFILE unconditionally since
userspace will only see the artificial offsets created by overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c21c839b 23-Apr-2020 Chengguang Xu <cgxu519@mykernel.net>

ovl: whiteout inode sharing

Share inode with different whiteout files for saving inode and speeding up
delete operation.

If EMLINK is encountered when linking a shared whiteout, create a new one.
In case of any other error, disable sharing for this super block.

Note: ofs->whiteout is protected by inode lock on workdir.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3011645b 02-Apr-2020 Amir Goldstein <amir73il@gmail.com>

ovl: cleanup non-empty directories in ovl_indexdir_cleanup()

Teach ovl_indexdir_cleanup() to remove temp directories containing
whiteouts to prepare for using index dir instead of work dir for removing
merge directories.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 926e94d7 21-Feb-2020 Amir Goldstein <amir73il@gmail.com>

ovl: enable xino automatically in more cases

So far, with xino=auto, we only enable xino if we know that all
underlying filesystem use 32bit inode numbers.

When users configure overlay with xino=auto, they already declare that
they are ready to handle 64bit inode number from overlay.

It is a very common case, that underlying filesystem uses 64bit ino,
but rarely or never uses the high inode number bits (e.g. tmpfs, xfs).
Leaving it for the users to declare high ino bits are unused with
xino=on is not a recipe for many users to enjoy the benefits of xino.

There appears to be very little reason not to enable xino when users
declare xino=auto even if we do not know how many bits underlying
filesystem uses for inode numbers.

In the worst case of xino bits overflow by real inode number, we
already fall back to the non-xino behavior - real inode number with
unique pseudo dev or to non persistent inode number and overlay st_dev
(for directories).

The only annoyance from auto enabling xino is that xino bits overflow
emits a warning to kmsg. Suppress those warnings unless users explicitly
asked for xino=on, suggesting that they expected high ino bits to be
unused by underlying filesystem.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# dfe51d47 21-Feb-2020 Amir Goldstein <amir73il@gmail.com>

ovl: avoid possible inode number collisions with xino=on

When xino feature is enabled and a real directory inode number overflows
the lower xino bits, we cannot map this directory inode number to a unique
and persistent inode number and we fall back to the real inode st_ino and
overlay st_dev.

The real inode st_ino with high bits may collide with a lower inode number
on overlay st_dev that was mapped using xino.

To avoid possible collision with legitimate xino values, map a non
persistent inode number to a dedicated range in the xino address space.
The dedicated range is created by adding one more bit to the number of
reserved high xino bits. We could have added just one more fsid, but that
would have had the undesired effect of changing persistent overlay inode
numbers on kernel or require more complex xino mapping code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 13464165 24-Jan-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: layer is const

The ovl_layer struct is never modified except at initialization.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0f831ec8 16-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: simplify ovl_same_sb() helper

No code uses the sb returned from this helper, so make it retrun a boolean
and rename it to ovl_same_fs().

The xino mode is irrelevant when all layers are on same fs, so instead of
describing samefs with mode OVL_XINO_OFF, use a new xino_mode state, which
is 0 in the case of samefs, -1 in the case of xino=off and > 0 with xino
enabled.

Create a new helper ovl_same_dev(), to use instead of the common check for
(ovl_same_fs() || xinobits).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1bd0a3ae 16-Dec-2019 lijiazi <jqqlijiazi@gmail.com>

ovl: use pr_fmt auto generate prefix

Use pr_fmt auto generate "overlayfs: " prefix.

Signed-off-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4c37e71b 22-Dec-2019 Amir Goldstein <amir73il@gmail.com>

ovl: fix wrong WARN_ON() in ovl_cache_update_ino()

The WARN_ON() that child entry is always on overlay st_dev became wrong
when we allowed this function to update d_ino in non-samefs setup with xino
enabled.

It is not true in case of xino bits overflow on a non-dir inode. Leave the
WARN_ON() only for directories, where assertion is still true.

Fixes: adbf4f7ea834 ("ovl: consistent d_ino for non-samefs with xino")
Cc: <stable@vger.kernel.org> # v4.17+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 67810693 17-Jul-2018 Amir Goldstein <amir73il@gmail.com>

ovl: fix wrong use of impure dir cache in ovl_iterate()

Only upper dir can be impure, but if we are in the middle of
iterating a lower real dir, dir could be copied up and marked
impure. We only want the impure cache if we started iterating
a real upper dir to begin with.

Aditya Kali reported that the following reproducer hits the
WARN_ON(!cache->refcount) in ovl_get_cache():

docker run --rm drupal:8.5.4-fpm-alpine \
sh -c 'cd /var/www/html/vendor/symfony && \
chown -R www-data:www-data . && ls -l .'

Reported-by: Aditya Kali <adityakali@google.com>
Tested-by: Aditya Kali <adityakali@google.com>
Fixes: 4edb83bb1041 ('ovl: constant d_ino for non-merge dirs')
Cc: <stable@vger.kernel.org> # v4.14
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# adbf4f7e 06-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: consistent d_ino for non-samefs with xino

When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.

In that case, relax non-samefs constraint for consistent d_ino and always
iterate non-merge dir using ovl_fill_real() actor so we can remap lower
inode numbers to unique lower fs range.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 24f0b172 11-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: whiteout orphan index entries on mount

Orphan index entries are non-dir index entries whose union nlink count
dropped to zero. With index=on, orphan index entries are removed on
mount. With NFS export feature enabled, orphan index entries are replaced
with white out index entries to block future open by handle from opening
the lower file.

When dir index has a stale 'upper' xattr, we assume that the upper dir
was removed and we treat the dir index as orphan entry that needs to be
whited out or removed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1eff1a1d 12-Dec-2017 Amir Goldstein <amir73il@gmail.com>

ovl: simplify arguments to ovl_check_origin_fh()

Pass the fs instance with lower_layers array instead of the dentry
lowerstack array to ovl_check_origin_fh(), because the dentry members
of lowerstack play no role in this helper.

This change simplifies the argument list of ovl_check_origin(),
ovl_cleanup_index() and ovl_verify_index().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a5a927a7 03-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: take mnt_want_write() for removing impure xattr

The optimization in ovl_cache_get_impure() that tries to remove an
unneeded "impure" xattr needs to take mnt_want_write() on upper fs.

Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs")
Cc: <stable@vger.kernel.org> #v4.14
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 6d0a8a90 10-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: take lower dir inode mutex outside upper sb_writers lock

The functions ovl_lower_positive() and ovl_check_empty_dir() both take
inode mutex on the real lower dir under ovl_want_write() which takes
the upper_mnt sb_writers lock.

While this is not a clear locking order or layering violation, it creates
an undesired lock dependency between two unrelated layers for no good
reason.

This lock dependency materializes to a false(?) positive lockdep warning
when calling rmdir() on a nested overlayfs, where both nested and
underlying overlayfs both use the same fs type as upper layer.

rmdir() on the nested overlayfs creates the lock chain:
sb_writers of upper_mnt (e.g. tmpfs) in ovl_do_remove()
ovl_i_mutex_dir_key[] of lower overlay dir in ovl_lower_positive()

rmdir() on the underlying overlayfs creates the lock chain in
reverse order:
ovl_i_mutex_dir_key[] of lower overlay dir in vfs_rmdir()
sb_writers of nested upper_mnt (e.g. tmpfs) in ovl_do_remove()

To rid of the unneeded locking dependency, move both ovl_lower_positive()
and ovl_check_empty_dir() to before ovl_want_write() in rmdir() and
rename() implementation.

This change spreads the pieces of ovl_check_empty_and_clear() directly
inside the rmdir()/rename() implementations so the helper is no longer
needed and removed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d796e77f 08-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix failure to fsync lower dir

As a writable mount, it is not expected for overlayfs to return
EINVAL/EROFS for fsync, even if dir/file is not changed.

This commit fixes the case of fsync of directory, which is easier to
address, because overlayfs already implements fsync file operation for
directories.

The problem reported by Raphael is that new PostgreSQL 10.0 with a
database in overlayfs where lower layer in squashfs fails to start.
The failure is due to fsync error, when PostgreSQL does fsync on all
existing db directories on startup and a specific directory exists
lower layer with no changes.

Reported-by: Raphael Hertzog <raphael@ouaza.com>
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Raphaƫl Hertzog <hertzog@debian.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3382290e 24-Oct-2017 Will Deacon <will@kernel.org>

locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

[ Note, this is a Git cherry-pick of the following commit:

506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# da2e6b7e 22-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix overlay: warning prefix

Conform two stray warning messages to the standard overlayfs: prefix.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b02a16e6 28-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: update ctx->pos on impure dir iteration

This fixes a regression with readdir of impure dir in overlayfs
that is shared to VM via 9p fs.

Reported-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs")
Cc: <stable@vger.kernel.org> #4.14
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b9343632 24-Jul-2017 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ovl: re-structure overlay lower layers in-memory

Define new structures to represent overlay instance lower layers and
overlay merge dir lower layers to make room for storing more per layer
information in-memory.

Instead of keeping the fs instance lower layers in an array of struct
vfsmount, keep them in an array of new struct ovl_layer, that has a
pointer to struct vfsmount.

Instead of keeping the dentry lower layers in an array of struct path,
keep them in an array of new struct ovl_path, that has a pointer to
struct dentry and to struct ovl_layer.

Add a small helper to find the fs layer id that correspopnds to a lower
struct ovl_path and use it in ovl_lookup().

[amir: split re-structure from anonymous bdev patch]

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 95e598e7 31-Oct-2017 zhangyi (F) <yi.zhang@huawei.com>

ovl: simplify ovl_check_empty_and_clear()

Filter out non-whiteout non-upper entries from list of merge dir entries
while checking if merge dir is empty in ovl_check_empty_dir().
The remaining work for ovl_clear_empty() is to clear all entries on the
list.

[amir: split patch from rmdir bug fix]

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b79e05aa 25-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: no direct iteration for dir with origin xattr

If a non-merge dir in an overlay mount has an overlay.origin xattr, it
means it was once an upper merge dir, which may contain whiteouts and
then the lower dir was removed under it.

Do not iterate real dir directly in this case to avoid exposing whiteouts.

[SzM] Set OVL_WHITEOUT for all merge directories as well.

[amir] A directory that was just copied up does not have the OVL_WHITEOUTS
flag. We need to set it to fix merge dir iteration.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fa0096e3 23-Oct-2017 Amir Goldstein <amir73il@gmail.com>

ovl: do not cleanup unsupported index entries

With index=on, ovl_indexdir_cleanup() tries to cleanup invalid index
entries (e.g. bad index name). This behavior could result in cleaning of
entries created by newer kernels and is therefore undesirable.
Instead, abort mount if such entries are encountered. We still cleanup
'stale' entries and 'orphan' entries, both those cases can be a result
of offline changes to lower and upper dirs.

When encoutering an index entry of type directory or whiteout, kernel
was supposed to fallback to read-only mount, but the fill_super()
operation returns EROFS in this case instead of returning success with
read-only mount flag, so mount fails when encoutering directory or
whiteout index entries. Bless this behavior by returning -EINVAL on
directory and whiteout index entries as we do for all unsupported index
entries.

Fixes: 61b674710cd9 ("ovl: do not cleanup directory and whiteout index..")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 506458ef 24-Oct-2017 Will Deacon <will@kernel.org>

locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# dc7ab677 24-Sep-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix dentry leak in ovl_indexdir_cleanup()

index dentry was not released when breaking out of the loop
due to index verification error.

Fixes: 415543d5c64f ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ff7a5fb0 07-Jun-2017 Peter Zijlstra <peterz@infradead.org>

overlayfs, locking: Remove smp_mb__before_spinlock() usage

While we could replace the smp_mb__before_spinlock() with the new
smp_mb__after_spinlock(), the normal pattern is to use
smp_store_release() to publish an object that is used for
lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
/ rcu_dereference() patterns.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4edb83bb 27-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: constant d_ino for non-merge dirs

Impure directories are ones which contain objects with origins (i.e. those
that have been copied up). These are relevant to readdir operation only
because of the d_ino field, no other transformation is necessary. Also a
directory can become impure between two getdents(2) calls.

This patch creates a cache for impure directories. Unlike the cache for
merged directories, this one only contains entries with origin and is not
refcounted but has a its lifetime tied to that of the dentry.

Similarly to the merged cache, the impure cache is invalidated based on a
version number. This version number is incremented when an entry with
origin is added or removed from the directory.

If the cache is empty, then the impure xattr is removed from the directory.

This patch also fixes up handling of d_ino for the ".." entry if the parent
directory is merged.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b5efccbe 11-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: constant d_ino across copy up

When all layers are on the same fs, and iterating a directory which may
contain copy up entries, call vfs_getattr() on the overlay entries to make
sure that d_ino will be consistent with st_ino from stat(2).

There is an overhead of lookup per upper entry in readdir.

The overhead is minimal if the iterated entries are already in dcache. It
is also quite useful for the common case of 'ls -l' that readdir() pre
populates the dcache with the listed entries, making the following stat()
calls faster.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 31e8ccea 27-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: fix readdir error value

actor's return value is taken as a bool (filled/not filled) so we need to
return the error in the context.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 61b67471 18-Jul-2017 Amir Goldstein <amir73il@gmail.com>

ovl: do not cleanup directory and whiteout index entries

Directory index entries are going to be used for looking up
redirected upper dirs by lower dir fh when decoding an overlay
file handle of a merge dir.

Whiteout index entries are going to be used as an indication that
an exported overlay file handle should be treated as stale (i.e.
after unlink of the overlay inode).

We don't know the verification rules for directory and whiteout
index entries, because they have not been implemented yet, so fail
to mount overlay rw if those entries are found to avoid corrupting
an index that was created by a newer kernel.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 415543d5 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: cleanup bad and stale index entries on mount

Bad index entries are entries whose name does not match the
origin file handle stored in trusted.overlay.origin xattr.
Bad index entries could be a result of a system power off in
the middle of copy up.

Stale index entries are entries whose origin file handle is
stale. Stale index entries could be a result of copying layers
or removing lower entries while the overlay is not mounted.
The case of copying layers should be detected earlier by the
verification of upper root dir origin and index dir origin.

Both bad and stale index entries are detected and removed
on mount.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# eea2fb48 01-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: proper cleanup of workdir

When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.

Previously the mount code removed this directory if it already existed and
created a new one. If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.

While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.

In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.

This patch implements cleaning out any files left in workdir. It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>


# 3fe6e52f 07-Apr-2016 Antonio Murdaca <amurdaca@redhat.com>

ovl: override creds with the ones from the superblock mounter

In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0) = 0
unlinkat(4, "1", 0) = 0
unlinkat(4, "3", 0) = 0
close(4) = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace. The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 00235411 25-May-2016 Al Viro <viro@zeniv.linux.org.uk>

restore killability of old mutex_lock_killable(&inode->i_mutex) users

The ones that are taking it exclusive, that is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9902af79 15-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

parallel lookups: actual switch to rwsem

ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 56656e96 21-Mar-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: rename is_merge to is_lowest

The 'is_merge' is an historical naming from when only a single lower layer
could exist. With the introduction of multiple lower layers the meaning of
this flag was changed to mean only the "lowest layer" (while all lower
layers were being merged).

So now 'is_merge' is inaccurate and hence renaming to 'is_lowest'

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 45aebeaf 22-Feb-2016 Vivek Goyal <vgoyal@redhat.com>

ovl: Ensure upper filesystem supports d_type

In some instances xfs has been created with ftype=0 and there if a file
on lower fs is removed, overlay leaves a whiteout in upper fs but that
whiteout does not get filtered out and is visible to overlayfs users.

And reason it does not get filtered out because upper filesystem does
not report file type of whiteout as DT_CHR during iterate_dir().

So it seems to be a requirement that upper filesystem support d_type for
overlayfs to work properly. Do this check during mount and fail if d_type
is not supported.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 84889d49 16-Nov-2015 Konstantin Khlebnikov <koct9i@gmail.com>

ovl: check dentry positiveness in ovl_cleanup_whiteouts()

This patch fixes kernel crash at removing directory which contains
whiteouts from lower layers.

Cache of directory content passed as "list" contains entries from all
layers, including whiteouts from lower layers. So, lookup in upper dir
(moved into work at this stage) will return negative entry. Plus this
cache is filled long before and we can race with external removal.

Example:
mkdir -p lower0/dir lower1/dir upper work overlay
touch lower0/dir/a lower0/dir/b
mknod lower1/dir/a c 0 0
mount -t overlay none overlay -o lowerdir=lower1:lower0,upperdir=upper,workdir=work
rm -fr overlay/dir

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # 3.18+


# cdb67279 22-Jun-2015 Miklos Szeredi <mszeredi@suse.cz>

ovl: lookup whiteouts outside iterate_dir()

If jffs2 can deadlock on overlayfs readdir because it takes the same lock
on ->iterate() as in ->lookup().

Fix by moving whiteout checking outside iterate_dir(). Optimized by
collecting potential whiteouts (DT_CHR) in a temporary list and if
non-empty iterating throug these and checking for a 0/0 chardev.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 49c21e1cacd7 ("ovl: check whiteout while reading directory")
Reported-by: Roman Yeryomin <leroi.lists@gmail.com>


# 4330397e 10-Dec-2014 hujianyang <hujianyang@huawei.com>

ovl: discard independent cursor in readdir()

Since the ovl_dir_cache is stable during a directory reading, the cursor
of struct ovl_dir_file don't need to be an independent entry in the list
of a merged directory.

This patch changes *cursor* to a pointer which points to the entry in the
ovl_dir_cache. After this, we don't need to check *is_cursor* either.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 3e01cee3 12-Dec-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: check whiteout on lowest layer as well

Not checking whiteouts on lowest layer was an optimization (there's nothing
to white out there), but it could result in inconsitent behavior when a
layer previously used as upper/middle is later used as lowest.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 9d7459d8 12-Dec-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: multi-layer readdir

If multiple lower layers exist, merge them as well in readdir according to
the same rules as merging upper with lower. I.e. take whiteouts and opaque
directories into account on all but the lowers layer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 1afaba1e 12-Dec-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: make path-type a bitmap

OVL_PATH_PURE_UPPER -> __OVL_PATH_UPPER | __OVL_PATH_PURE
OVL_PATH_UPPER -> __OVL_PATH_UPPER
OVL_PATH_MERGE -> __OVL_PATH_UPPER | __OVL_PATH_MERGE
OVL_PATH_LOWER -> 0

Multiple R/O layers will allow __OVL_PATH_MERGE without __OVL_PATH_UPPER.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 49c21e1c 12-Dec-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: check whiteout while reading directory

Don't make a separate pass for checking whiteouts, since we can do it while
reading the upper directory.

This will make it easier to handle multiple layers.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 7676895f 20-Nov-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: ovl_dir_fsync() cleanup

Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE. For a copied up
directory the two are currently equivalent.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# c9f00fdb 20-Nov-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: pass dentry into ovl_dir_read_merged()

Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath.
This cleans up callers and paves the way for multi-layer directory reads.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 3f822c62 04-Nov-2014 Miklos Szeredi <miklos@szeredi.hu>

ovl: don't poison cursor

ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be
rebuilt. We did list_del() on the cursor, which results in an Oops on the
poisoned pointer in ovl_seek_cursor().

Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ac7576f4 30-Oct-2014 Miklos Szeredi <miklos@szeredi.hu>

vfs: make first argument of dir_context.actor typed

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9f2f7d4c 31-Oct-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: initialize ->is_cursor

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d1b72cc6 27-Oct-2014 Miklos Szeredi <miklos@szeredi.hu>

overlayfs: fix lockdep misannotation

In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:

touch /mnt/a/empty102/x
unlink /mnt/a/empty102/x
rmdir /mnt/a/empty102

It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c2096537 27-Oct-2014 Miklos Szeredi <miklos@szeredi.hu>

ovl: fix check for cursor

ovl_cache_entry.name is now an array not a pointer, so it makes no sense
test for it being NULL.

Detected by coverity.

From: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 68bf8611076a ("overlayfs: make ovl_cache_entry->name an array instead of
+pointer")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d45f00ae 28-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: barriers for opening upper-layer directory

make sure that
a) all stores done by opening struct file don't leak past storing
the reference in od->upperfile
b) the lockless side has read dependency barrier

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# db6ec212 23-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: embed middle into overlay_readdir_data

same story...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 49be4fb9 23-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: embed root into overlay_readdir_data

no sense having it a pointer - all instances have it pointing to
local variable in the same stack frame

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 68bf8611 23-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: make ovl_cache_entry->name an array instead of pointer

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3d268c9b 23-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: don't hold ->i_mutex over opening the real directory

just use it to serialize the assignment

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e9be9d5e 23-Oct-2014 Miklos Szeredi <mszeredi@suse.cz>

overlay filesystem

Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay

The following cotributions have been folded into this patch:

Neil Brown <neilb@suse.de>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <nbd@openwrt.org>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts

Erez Zadok <ezk@fsl.cs.sunysb.edu>
- fix cleanup after WARN_ON

Sedat Dilek <sedat.dilek@googlemail.com>
- fix up permission to confirm to new API

Robin Dong <hao.bigrat@gmail.com>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link

Andy Whitcroft <apw@canonical.com>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode

AV:
- ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
- ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
lack of check for udir being the parent of upper, dropping and regaining
the lock on udir (which would require _another_ check for parent being
right).
- bogus d_drop() in copyup and rename [fix from your mail]
- copyup/remove and copyup/rename races [fix from your mail]
- ovl_dir_fsync() leaving ERR_PTR() in ->realfile
- ovl_entry_free() is pointless - it's just a kfree_rcu()
- fold ovl_do_lookup() into ovl_lookup()
- manually assigning ->d_op is wrong. Just use ->s_d_op.
[patches picked from Miklos]:
* copyup/remove and copyup/rename races
* bogus d_drop() in copyup and rename

Also thanks to the following people for testing and reporting bugs:

Jordi Pujol <jordipujolp@gmail.com>
Andy Whitcroft <apw@canonical.com>
Michal Suchanek <hramrach@centrum.cz>
Felix Fietkau <nbd@openwrt.org>
Erez Zadok <ezk@fsl.cs.sunysb.edu>
Randy Dunlap <rdunlap@xenotime.net>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>