History log of /linux-master/fs/overlayfs/ovl_entry.h
Revision Date Author Comments
# 420332b9 19-Jan-2024 Amir Goldstein <amir73il@gmail.com>

ovl: mark xwhiteouts directory with overlay.opaque='x'

An opaque directory cannot have xwhiteouts, so instead of marking an
xwhiteouts directory with a new xattr, overload overlay.opaque xattr
for marking both opaque dir ('y') and xwhiteouts dir ('x').

This is more efficient as the overlay.opaque xattr is checked during
lookup of directory anyway.

This also prevents unnecessary checking the xattr when reading a
directory without xwhiteouts, i.e. most of the time.

Note that the xwhiteouts marker is not checked on the upper layer and
on the last layer in lowerstack, where xwhiteouts are not expected.

Fixes: bc8df7a3dc03 ("ovl: Add an alternative type of whiteout")
Cc: <stable@vger.kernel.org> # v6.7
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Tested-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 02d70090 19-Nov-2023 Amir Goldstein <amir73il@gmail.com>

ovl: remove redundant ofs->indexdir member

When the index feature is disabled, ofs->indexdir is NULL.
When the index feature is enabled, ofs->indexdir has the same value as
ofs->workdir and takes an extra reference.

This makes the code harder to understand when it is not always clear
that ofs->indexdir in one function is the same dentry as ofs->workdir
in another function.

Remove this redundancy, by referencing ofs->workdir directly in index
helpers and by using the ovl_indexdir() accessor in generic code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# a535116d 02-Oct-2023 Amir Goldstein <amir73il@gmail.com>

ovl: make use of ->layers safe in rcu pathwalk

ovl_permission() accesses ->layers[...].mnt; we can't have ->layers
freed without an RCU delay on fs shutdown.

Fortunately, kern_unmount_array() that is used to drop those mounts
does include an RCU delay, so freeing is delayed; unfortunately, the
array passed to kern_unmount_array() is formed by mangling ->layers
contents and that happens without any delays.

The ->layers[...].name string entries are used to store the strings to
display in "lowerdir=..." by ovl_show_options(). Those entries are not
accessed in RCU walk.

Move the name strings into a separate array ofs->config.lowerdirs and
reuse the ofs->config.lowerdirs array as the temporary mount array to
pass to kern_unmount_array().

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20231002023711.GP3389589@ZenIV/
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# adcd459f 21-May-2023 Andrea Righi <andrea.righi@canonical.com>

ovl: validate superblock in OVL_FS()

When CONFIG_OVERLAY_FS_DEBUG is enabled add an explicit check to make
sure that OVL_FS() is always used with a valid overlayfs superblock.
Otherwise trigger a WARN_ON_ONCE().

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# f01d0889 21-May-2023 Andrea Righi <andrea.righi@canonical.com>

ovl: make consistent use of OVL_FS()

Always use OVL_FS() to retrieve the corresponding struct ovl_fs from a
struct super_block.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# b0504bfe 26-Jun-2023 Amir Goldstein <amir73il@gmail.com>

ovl: add support for unique fsid per instance

The legacy behavior of ovl_statfs() reports the f_fsid filled by
underlying upper fs. This fsid is not unique among overlayfs instances
on the same upper fs.

With mount option uuid=on, generate a non-persistent uuid per overlayfs
instance and use it as the seed for f_fsid, similar to tmpfs.

This is useful for reporting fanotify events with fid info from different
instances of overlayfs over the same upper fs.

The old behavior of null uuid and upper fs fsid is retained with the
mount option uuid=null, which is the default.

The mount option uuid=off that disables uuid checks in underlying layers
also retains the legacy behavior.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 16aac5ad 23-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: support encoding non-decodable file handles

When all layers support file handles, we support encoding non-decodable
file handles (a.k.a. fid) even with nfs_export=off.

When file handles do not need to be decoded, we do not need to copy up
redirected lower directories on encode, and we encode also non-indexed
upper with lower file handle, so fid will not change on copy up.

This enables reporting fanotify events with file handles on overlayfs
with default config/mount options.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# ae8cba40 19-Apr-2023 Alexander Larsson <alexl@redhat.com>

ovl: Add framework for verity support

This adds the scaffolding (docs, config, mount options) for supporting
the new digest field in the metacopy xattr. This contains a fs-verity
digest that need to match the fs-verity digest of the lowerdata
file. The mount option "verity" specifies how this xattr is handled.

If you enable verity ("verity=on") all existing xattrs are validated
before use, and during metacopy we generate verity xattr in the upper
metacopy file (if the source file has verity enabled). This means
later accesses can guarantee that the same data is used.

Additionally you can use "verity=require". In this mode all metacopy
files must have a valid verity xattr. For this to work metadata
copy-up must be able to create a verity xattr (so that later accesses
are validated). Therefore, in this mode, if the lower data file
doesn't have fs-verity enabled we fall back to a full copy rather than
a metacopy.

Actual implementation follows in a separate commit.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# b36a5780 16-Jun-2023 Christian Brauner <brauner@kernel.org>

ovl: modify layer parameter parsing

We ran into issues where mount(8) passed multiple lower layers as one
big string through fsconfig(). But the fsconfig() FSCONFIG_SET_STRING
option is limited to 256 bytes in strndup_user(). While this would be
fixable by extending the fsconfig() buffer I'd rather encourage users to
append layers via multiple fsconfig() calls as the interface allows
nicely for this. This has also been requested as a feature before.

With this port to the new mount api the following will be possible:

fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", "/lower1", 0);

/* set upper layer */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "upperdir", "/upper", 0);

/* append "/lower2", "/lower3", and "/lower4" */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower2:/lower3:/lower4", 0);

/* turn index feature on */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "index", "on", 0);

/* append "/lower5" */
fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", ":/lower5", 0);

Specifying ':' would have been rejected so this isn't a regression. And
we can't simply use "lowerdir=/lower" to append on top of existing
layers as "lowerdir=/lower,lowerdir=/other-lower" would make
"/other-lower" the only lower layer so we'd break uapi if we changed
this. So the ':' prefix seems a good compromise.

Users can choose to specify multiple layers at once or individual
layers. A layer is appended if it starts with ":". This requires that
the user has already added at least one layer before. If lowerdir is
specified again without a leading ":" then all previous layers are
dropped and replaced with the new layers. If lowerdir is specified and
empty than all layers are simply dropped.

An additional change is that overlayfs will now parse and resolve layers
right when they are specified in fsconfig() instead of deferring until
super block creation. This allows users to receive early errors.

It also allows users to actually use up to 500 layers something which
was theoretically possible but ended up not working due to the mount
option string passed via mount(2) being too large.

This also allows a more privileged process to set config options for a
lesser privileged process as the creds for fsconfig() and the creds for
fsopen() can differ. We could restrict that they match by enforcing that
the creds of fsopen() and fsconfig() match but I don't see why that
needs to be the case and allows for a good delegation mechanism.

Plus, in the future it means we're able to extend overlayfs mount
options and allow users to specify layers via file descriptors instead
of paths:

fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower1", dirfd);

/* append */
fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower2", dirfd);

/* append */
fsconfig(FSCONFIG_SET_PATH{_EMPTY}, "lowerdir", "lower3", dirfd);

/* clear all layers specified until now */
fsconfig(FSCONFIG_SET_STRING, "lowerdir", NULL, 0);

This would be especially nice if users create an overlayfs mount on top
of idmapped layers or just in general private mounts created via
open_tree(OPEN_TREE_CLONE). Those mounts would then never have to appear
anywhere in the filesystem. But for now just do the minimal thing.

We should probably aim to move more validation into ovl_fs_parse_param()
so users get errors before fsconfig(FSCONFIG_CMD_CREATE). But that can
be done in additional patches later.

This is now also rebased on top of the lazy lowerdata lookup which
allows the specificatin of data only layers using the new "::" syntax.

The rules are simple. A data only layers cannot be followed by any
regular layers and data layers must be preceeded by at least one regular
layer.

Parsing the lowerdir mount option must change because of this. The
original patchset used the old lowerdir parsing function to split a
lowerdir mount option string such as:

lowerdir=/lower1:/lower2::/lower3::/lower4

simply replacing each non-escaped ":" by "\0". So sequences of
non-escaped ":" were counted as layers. For example, the previous
lowerdir mount option above would've counted 6 layers instead of 4 and a
lowerdir mount option such as:

lowerdir="/lower1:/lower2::/lower3::/lower4:::::::::::::::::::::::::::"

would be counted as 33 layers. Other than being ugly this didn't matter
much because kern_path() would reject the first "\0" layer. However,
this overcounting of layers becomes problematic when we base allocations
on it where we very much only want to allocate space for 4 layers
instead of 33.

So the new parsing function rejects non-escaped sequences of colons
other than ":" and "::" immediately instead of relying on kern_path().

Link: https://github.com/util-linux/util-linux/issues/2287
Link: https://github.com/util-linux/util-linux/issues/1992
Link: https://bugs.archlinux.org/task/78702
Link: https://lore.kernel.org/linux-unionfs/20230530-klagen-zudem-32c0908c2108@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# af5f2396 17-Jun-2023 Amir Goldstein <amir73il@gmail.com>

ovl: store enum redirect_mode in config instead of a string

Do all the logic to set the mode during mount options parsing and
do not keep the option string around.

Use a constant_table to translate from enum redirect mode to string
in preperation for new mount api option parsing.

The mount option "off" is translated to either "follow" or "nofollow",
depending on the "redirect_always_follow" build/module config, so
in effect, there are only three possible redirect modes.

This results in a minor change to the string that is displayed
in show_options() - when redirect_dir is enabled by default and the user
mounts with the option "redirect_dir=off", instead of displaying the mode
"redirect_dir=off" in show_options(), the displayed mode will be either
"redirect_dir=follow" or "redirect_dir=nofollow", depending on the value
of "redirect_always_follow" build/module config.

The displayed mode reflects the effective mode, so mounting overlayfs
again with the dispalyed redirect_dir option will result with the same
effective and displayed mode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# e4599d4b 17-Jun-2023 Amir Goldstein <amir73il@gmail.com>

ovl: negate the ofs->share_whiteout boolean

The default common case is that whiteout sharing is enabled.
Change to storing the negated no_shared_whiteout state, so we will not
need to initialize it.

This is the first step towards removing all config and feature
initializations out of ovl_fill_super().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# f723edb8 13-Jun-2023 Christian Brauner <brauner@kernel.org>

ovl: check type and offset of struct vfsmount in ovl_entry

Porting overlayfs to the new amount api I started experiencing random
crashes that couldn't be explained easily. So after much debugging and
reasoning it became clear that struct ovl_entry requires the point to
struct vfsmount to be the first member and of type struct vfsmount.

During the port I added a new member at the beginning of struct
ovl_entry which broke all over the place in the form of random crashes
and cache corruptions. While there's a comment in ovl_free_fs() to the
effect of "Hack! Reuse ofs->layers as a vfsmount array before freeing
it" there's no such comment on struct ovl_entry which makes this easy to
trip over.

Add a comment and two static asserts for both the offset and the type of
pointer in struct ovl_entry.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 42dd69ae 27-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: implement lazy lookup of lowerdata in data-only layers

Defer lookup of lowerdata in the data-only layers to first data access
or before copy up.

We perform lowerdata lookup before copy up even if copy up is metadata
only copy up. We can further optimize this lookup later if needed.

We do best effort lazy lookup of lowerdata for d_real_inode(), because
this interface does not expect errors. The only current in-tree caller
of d_real_inode() is trace_uprobe and this caller is likely going to be
followed reading from the file, before placing uprobes on offset within
the file, so lowerdata should be available when setting the uprobe.

Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2b21da92 26-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: prepare to store lowerdata redirect for lazy lowerdata lookup

Prepare to allow ovl_lookup() to leave the last entry in a non-dir
lowerstack empty to signify lazy lowerdata lookup.

In this case, ovl_lookup() stores the redirect path from metacopy to
lowerdata in ovl_inode, which is going to be used later to perform the
lazy lowerdata lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 37ebf056 26-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: introduce data-only lower layers

Introduce the format lowerdir=lower1:lower2::lowerdata1::lowerdata2
where the lower layers on the right of the :: separators are not merged
into the overlayfs merge dirs.

Data-only lower layers are only allowed at the bottom of the stack.

The files in those layers are only meant to be accessible via absolute
redirect from metacopy files in lower layers. Following changes will
implement lookup in the data layers.

This feature was requested for composefs ostree use case, where the
lower data layer should only be accessiable via absolute redirects
from metacopy inodes.

The lower data layers are not required to a have a unique uuid or any
uuid at all, because they are never used to compose the overlayfs inode
st_ino/st_dev.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ab1eb5ff 01-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: deduplicate lowerdata and lowerstack[]

The ovl_inode contains a copy of lowerdata in lowerstack[], so the
lowerdata inode member can be removed.

Use accessors ovl_lowerdata*() to get the lowerdata whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ac900ed4 01-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: deduplicate lowerpath and lowerstack[]

The ovl_inode contains a copy of lowerpath in lowerstack[0], so the
lowerpath member can be removed.

Use accessor ovl_lowerpath() to get the lowerpath whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0af950f5 07-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: move ovl_entry into ovl_inode

The lower stacks of all the ovl inode aliases should be identical
and there is redundant information in ovl_entry and ovl_inode.

Move lowerstack into ovl_inode and keep only the OVL_E_FLAGS
per overlay dentry.

Following patches will deduplicate redundant ovl_inode fields.

Note that for pure upper and negative dentries, OVL_E(dentry) may be
NULL now, so it is imporatnt to use the ovl_numlower() accessor.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 163db0da 03-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: factor out ovl_free_entry() and ovl_stack_*() helpers

In preparation for moving lowerstack into ovl_inode.

Note that in ovl_lookup() the temp stack dentry refs are now cloned
into the final ovl_lowerstack instead of being transferred, so cleanup
always needs to call ovl_stack_free(stack).

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5522c9c7 03-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: use ovl_numlower() and ovl_lowerstack() accessors

This helps fortify against dereferencing a NULL ovl_entry,
before we move the ovl_entry reference into ovl_inode.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a6ff2bc0 14-Mar-2023 Amir Goldstein <amir73il@gmail.com>

ovl: use OVL_E() and OVL_E_FLAGS() accessors

Instead of open coded instances, because we are about to split
the two apart.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# abf08576 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port vfs_*() helpers to struct mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# bc70682a 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: support idmapped layers

Now that overlay is able to take a layers idmapping into account allow
overlay mounts to be created on top of idmapped mounts.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ffa5723c 03-Apr-2022 Amir Goldstein <amir73il@gmail.com>

ovl: store lower path in ovl_inode

Create some ovl_i_* helpers to get real path from ovl inode. Instead of
just stashing struct inode for the lower layer we stash struct path for
the lower layer. The helpers allow to retrieve a struct path for the
relevant upper or lower layer. This will be used when retrieving
information based on struct inode when copying up inode attributes from
upper or lower inodes to ovl inodes and when checking permissions in
ovl_permission() in following patches. This is needed to support
idmapped base layers with overlay.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b65c20ac 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: add ovl_upper_mnt_userns() wrapper

Add a tiny wrapper to retrieve the upper mount's idmapping. Have it
return the initial idmapping until we have prepared and converted all
places to take the relevant idmapping into account. Then we can switch
on idmapped layer support by having ovl_upper_mnt_userns() return the
upper mount's idmapping.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 335d3fc5 07-Jan-2021 Sargun Dhillon <sargun@sargun.me>

ovl: implement volatile-specific fsync error behaviour

Overlayfs's volatile option allows the user to bypass all forced sync calls
to the upperdir filesystem. This comes at the cost of safety. We can never
ensure that the user's data is intact, but we can make a best effort to
expose whether or not the data is likely to be in a bad state.

The best way to handle this in the time being is that if an overlayfs's
upperdir experiences an error after a volatile mount occurs, that error
will be returned on fsync, fdatasync, sync, and syncfs. This is
contradictory to the traditional behaviour of VFS which fails the call
once, and only raises an error if a subsequent fsync error has occurred,
and been raised by the filesystem.

One awkward aspect of the patch is that we have to manually set the
superblock's errseq_t after the sync_fs callback as opposed to just
returning an error from syncfs. This is because the call chain looks
something like this:

sys_syncfs ->
sync_filesystem ->
__sync_filesystem ->
/* The return value is ignored here
sb->s_op->sync_fs(sb)
_sync_blockdev
/* Where the VFS fetches the error to raise to userspace */
errseq_check_and_advance

Because of this we call errseq_set every time the sync_fs callback occurs.
Due to the nature of this seen / unseen dichotomy, if the upperdir is an
inconsistent state at the initial mount time, overlayfs will refuse to
mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
will increment on error until the user calls syncfs.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: c86243b090bc ("ovl: provide a mount option "volatile"")
Cc: stable@vger.kernel.org
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2d2f2d73 14-Dec-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: user xattr

Optionally allow using "user.overlay." namespace instead of
"trusted.overlay."

This is necessary for overlayfs to be able to be mounted in an unprivileged
namepsace.

Make the option explicit, since it makes the filesystem format be
incompatible.

Disable redirect_dir and metacopy options, because these would allow
privilege escalation through direct manipulation of the
"user.overlay.redirect" or "user.overlay.metacopy" xattrs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>


# 5830fb6b 13-Oct-2020 Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

ovl: introduce new "uuid=off" option for inodes index feature

This replaces uuid with null in overlayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there is
only one filesystem for all the work/upper/lower directories and bare file
handles from this backing filesystem are unique. In other case when we have
multiple filesystems lets just fallback to "uuid=on" which is and
equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with the
original one. This way the "copy" container can't have the same uuid on the
superblock and mounting the overlayfs from it later would fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
#/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
#mount: /loop-test/loop-mp/merged:
#mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
create the copy of container and we require fs uuid to be unique for a new
container.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c86243b0 31-Aug-2020 Vivek Goyal <vgoyal@redhat.com>

ovl: provide a mount option "volatile"

Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build requirement
is such that they don't care if a node goes down while build was still
going on. In that case, they will simply throw away unfinished layer and
start new build. So they don't care about syncing intermediate state to the
disk and hence don't want to pay the price associated with sync.

So they are asking for mount options where they can disable sync on overlay
mount point.

They primarily seem to have two use cases.

- For building images, they will mount overlay with nosync and then sync
upper layer after unmounting overlay and reuse upper as lower for next
layer.

- For running containers, they don't seem to care about syncing upper layer
because if node goes down, they will simply throw away upper layer and
create a fresh one.

So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if system
crashes or shuts down and start fresh.

With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to 25
seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take out the
network operations latency out of the measurement.

Giuseppe is also looking to cut down on number of iops done on the disk. He
is complaining that often in cloud their VMs are throttled if they cross
the limit. This option can help them where they reduce number of iops (by
cutting down on frequent sync and writebacks).

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b8e42a65 04-Jun-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: get rid of redundant members in struct ovl_fs

ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
used instead of a separate ->upperdir_trap.

Split the lowerdir option early to get the number of layers, then allocate
the ->layers array, and finally fill the upper and lower layers, as before.

Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
take care of that.

[Colin Ian King] Fix null pointer dereference on null stack pointer on
error return found by Coverity.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 08f4c7c8 04-Jun-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: add accessor for ofs->upper_mnt

Next patch will remove ofs->upper_mnt, so add an accessor function for this
field.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c21c839b 23-Apr-2020 Chengguang Xu <cgxu519@mykernel.net>

ovl: whiteout inode sharing

Share inode with different whiteout files for saving inode and speeding up
delete operation.

If EMLINK is encountered when linking a shared whiteout, create a new one.
In case of any other error, disable sharing for this super block.

Note: ofs->whiteout is protected by inode lock on workdir.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4d314f78 21-Feb-2020 Amir Goldstein <amir73il@gmail.com>

ovl: use a private non-persistent ino pool

There is no reason to deplete the system's global get_next_ino() pool for
overlay non-persistent inode numbers and there is no reason at all to
allocate non-persistent inode numbers for non-directories.

For non-directories, it is much better to leave i_ino the same as real
i_ino, to be consistent with st_ino/d_ino.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 13464165 24-Jan-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: layer is const

The ovl_layer struct is never modified except at initialization.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1b81dddd 16-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: fix corner case of conflicting lower layer uuid

This fixes ovl_lower_uuid_ok() to correctly detect the corner case:
- two filesystems, A and B, both have null uuid
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B

In this case, bad_uuid would not have been set for B, because the check
only involved the list of lower fs. Hence we'll try to decode a layer 2
origin on layer 1 and fail.

We check for conflicting (and null) uuid among all lower layers, including
those layers that are on the same fs as the upper layer.

Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 07f1e596 14-Jan-2020 Amir Goldstein <amir73il@gmail.com>

ovl: generalize the lower_fs[] array

Rename lower_fs[] array to fs[], extend its size by one and use index fsid
(instead of fsid-1) to access the fs[] array.

Initialize fs[0] with upper fs values. fsid 0 is reserved even with lower
only overlay, so fs[0] remains null in this case.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0f831ec8 16-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: simplify ovl_same_sb() helper

No code uses the sb returned from this helper, so make it retrun a boolean
and rename it to ovl_same_fs().

The xino mode is irrelevant when all layers are on same fs, so instead of
describing samefs with mode OVL_XINO_OFF, use a new xino_mode state, which
is 0 in the case of samefs, -1 in the case of xino=off and > 0 with xino
enabled.

Create a new helper ovl_same_dev(), to use instead of the common check for
(ovl_same_fs() || xinobits).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 94375f9d 15-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: generalize the lower_layers[] array

Rename lower_layers[] array to layers[], extend its size by one and
initialize layers[0] with upper layer values. Lower layers are now
addressed with index 1..numlower. layers[0] is reserved even with lower
only overlay.

[SzM: replace ofs->numlower with ofs->numlayer, the latter's value is
incremented by one]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7e63c87f 14-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: fix lookup failure on multi lower squashfs

In the past, overlayfs required that lower fs have non null uuid in
order to support nfs export and decode copy up origin file handles.

Commit 9df085f3c9a2 ("ovl: relax requirement for non null uuid of
lower fs") relaxed this requirement for nfs export support, as long
as uuid (even if null) is unique among all lower fs.

However, said commit unintentionally also relaxed the non null uuid
requirement for decoding copy up origin file handles, regardless of
the unique uuid requirement.

Amend this mistake by disabling decoding of copy up origin file handle
from lower fs with a conflicting uuid.

We still encode copy up origin file handles from those fs, because
file handles like those already exist in the wild and because they
might provide useful information in the future.

There is an unhandled corner case described by Miklos this way:
- two filesystems, A and B, both have null uuid
- upper layer is on A
- lower layer 1 is also on A
- lower layer 2 is on B

In this case bad_uuid won't be set for B, because the check only
involves the list of lower fs. Hence we'll try to decode a layer 2
origin on layer 1 and fail.

We will deal with this corner case later.

Reported-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/
Fixes: 9df085f3c9a2 ("ovl: relax requirement for non null uuid ...")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0be0bfd2 12-Jul-2019 Amir Goldstein <amir73il@gmail.com>

ovl: fix regression caused by overlapping layers detection

Once upon a time, commit 2cac0c00a6cd ("ovl: get exclusive ownership on
upper/work dirs") in v4.13 added some sanity checks on overlayfs layers.
This change caused a docker regression. The root cause was mount leaks
by docker, which as far as I know, still exist.

To mitigate the regression, commit 85fdee1eef1a ("ovl: fix regression
caused by exclusive upper/work dir protection") in v4.14 turned the
mount errors into warnings for the default index=off configuration.

Recently, commit 146d62e5a586 ("ovl: detect overlapping layers") in
v5.2, re-introduced exclusive upper/work dir checks regardless of
index=off configuration.

This changes the status quo and mount leak related bug reports have
started to re-surface. Restore the status quo to fix the regressions.
To clarify, index=off does NOT relax overlapping layers check for this
ovelayfs mount. index=off only relaxes exclusive upper/work dir checks
with another overlayfs mount.

To cover the part of overlapping layers detection that used the
exclusive upper/work dir checks to detect overlap with self upper/work
dir, add a trap also on the work base dir.

Link: https://github.com/moby/moby/issues/34672
Link: https://lore.kernel.org/linux-fsdevel/20171006121405.GA32700@veci.piliscsaba.szeredi.hu/
Link: https://github.com/containers/libpod/issues/3540
Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Colin Walters <walters@verbum.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 146d62e5 18-Apr-2019 Amir Goldstein <amir73il@gmail.com>

ovl: detect overlapping layers

Overlapping overlay layers are not supported and can cause unexpected
behavior, but overlayfs does not currently check or warn about these
configurations.

User is not supposed to specify the same directory for upper and
lower dirs or for different lower layers and user is not supposed to
specify directories that are descendants of each other for overlay
layers, but that is exactly what this zysbot repro did:

https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000

Moving layer root directories into other layers while overlayfs
is mounted could also result in unexpected behavior.

This commit places "traps" in the overlay inode hash table.
Those traps are dummy overlay inodes that are hashed by the layers
root inodes.

On mount, the hash table trap entries are used to verify that overlay
layers are not overlapping. While at it, we also verify that overlay
layers are not overlapping with directories "in-use" by other overlay
instances as upperdir/workdir.

On lookup, the trap entries are used to verify that overlay layers
root inodes have not been moved into other layers after mount.

Some examples:

$ ./run --ov --samefs -s
...
( mkdir -p base/upper/0/u base/upper/0/w base/lower lower upper mnt
mount -o bind base/lower lower
mount -o bind base/upper upper
mount -t overlay none mnt ...
-o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w)

$ umount mnt
$ mount -t overlay none mnt ...
-o lowerdir=base,upperdir=upper/0/u,workdir=upper/0/w

[ 94.434900] overlayfs: overlapping upperdir path
mount: mount overlay on mnt failed: Too many levels of symbolic links

$ mount -t overlay none mnt ...
-o lowerdir=upper/0/u,upperdir=upper/0/u,workdir=upper/0/w

[ 151.350132] overlayfs: conflicting lowerdir path
mount: none is already mounted or mnt busy

$ mount -t overlay none mnt ...
-o lowerdir=lower:lower/a,upperdir=upper/0/u,workdir=upper/0/w

[ 201.205045] overlayfs: overlapping lowerdir path
mount: mount overlay on mnt failed: Too many levels of symbolic links

$ mount -t overlay none mnt ...
-o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w
$ mv base/upper/0/ base/lower/
$ find mnt/0
mnt/0
mnt/0/w
find: 'mnt/0/w/work': Too many levels of symbolic links
find: 'mnt/0/u': Too many levels of symbolic links

Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2664bd08 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Store lower data inode in ovl_inode

Right now ovl_inode stores inode pointer for lower inode. This helps with
quickly getting lower inode given overlay inode (ovl_inode_lower()).

Now with metadata only copy-up, we can have metacopy inode in middle layer
as well and inode containing data can be different from ->lower. I need to
be able to open the real file in ovl_open_realfile() and for that I need to
quickly find the lower data inode.

Hence store lower data inode also in ovl_inode. Also provide an helper
ovl_inode_lowerdata() to access this field.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d5791044 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Provide a mount option metacopy=on/off for metadata copyup

By default metadata only copy up is disabled. Provide a mount option so
that users can choose one way or other.

Also provide a kernel config and module option to enable/disable metacopy
feature.

metacopy feature requires redirect_dir=on when upper is present.
Otherwise, it requires redirect_dir=follow atleast.

As of now, metacopy does not work with nfs_export=on. So if both
metacopy=on and nfs_export=on then nfs_export is disabled.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 795939a9 29-Mar-2018 Amir Goldstein <amir73il@gmail.com>

ovl: add support for "xino" mount and config options

With mount option "xino=on", mounter declares that there are enough
free high bits in underlying fs to hold the layer fsid.
If overlayfs does encounter underlying inodes using the high xino
bits reserved for layer fsid, a warning will be emitted and the original
inode number will be used.

The mount option name "xino" goes after a similar meaning mount option
of aufs, but in overlayfs case, the mapping is stateless.

An example for a use case of "xino=on" is when upper/lower is on an xfs
filesystem. xfs uses 64bit inode numbers, but it currently never uses the
upper 8bit for inode numbers exposed via stat(2) and that is not likely to
change in the future without user opting-in for a new xfs feature. The
actual number of unused upper bit is much larger and determined by the xfs
filesystem geometry (64 - agno_log - agblklog - inopblog). That means
that for all practical purpose, there are enough unused bits in xfs
inode numbers for more than OVL_MAX_STACK unique fsid's.

Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
numbers are allocated sequentially since boot, so they will practially
never use the high inode number bits.

For compatibility with applications that expect 32bit inodes, the feature
can be disabled with "xino=off". The option "xino=auto" automatically
detects underlying filesystem that use 32bit inodes and enables the
feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
the same name, determine if the default mode for overlayfs mount is
"xino=auto" or "xino=off".

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# e487d889 07-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: constant st_ino for non-samefs with xino

On 64bit systems, when overlay layers are not all on the same fs, but
all inode numbers of underlying fs are not using the high bits, use the
high bits to partition the overlay st_ino address space. The high bits
hold the fsid (upper fsid is 0). This way overlay inode numbers are unique
and all inodes use overlay st_dev. Inode numbers are also persistent
for a given layer configuration.

Currently, our only indication for available high ino bits is from a
filesystem that supports file handles and uses the default encode_fh()
operation, which encodes a 32bit inode number.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5148626b 28-Mar-2018 Amir Goldstein <amir73il@gmail.com>

ovl: allocate anon bdev per unique lower fs

Instead of allocating an anonymous bdev per lower layer, allocate
one anonymous bdev per every unique lower fs that is different than
upper fs.

Every unique lower fs is assigned an fsid > 0 and the number of
unique lower fs are stored in ofs->numlowerfs.

The assigned fsid is stored in the lower layer struct and will be
used also for inode number multiplexing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c62520a8 14-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: store 'has_upper' and 'opaque' as bit flags

We need to make some room in struct ovl_entry to store information
about redirected ancestors for NFS export, so cram two booleans as
bit flags.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# f168f109 19-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: add support for "nfs_export" configuration

Introduce the "nfs_export" config, module and mount options.

The NFS export feature depends on the "index" feature and enables two
implicit overlayfs features: "index_all" and "verify_lower".
The "index_all" feature creates an index on copy up of every file and
directory. The "verify_lower" feature uses the full index to detect
overlay filesystems inconsistencies on lookup, like redirect from
multiple upper dirs to the same lower dir.

NFS export can be enabled for non-upper mount with no index. However,
because lower layer redirects cannot be verified with the index, enabling
NFS export support on an overlay with no upper layer requires turning off
redirect follow (e.g. "redirect_dir=nofollow").

The full index may incur some overhead on mount time, especially when
verifying that lower directory file handles are not stale.

NFS export support, full index and consistency verification will be
implemented by following patches.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d583ed7d 08-Nov-2017 Amir Goldstein <amir73il@gmail.com>

ovl: store layer index in ovl_layer

Store the fs root layer index inside ovl_layer struct, so we can
get the root fs layer index from merge dir lower layer instead of
find it with ovl_find_layer() helper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3382290e 24-Oct-2017 Will Deacon <will@kernel.org>

locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

[ Note, this is a Git cherry-pick of the following commit:

506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 438c84c2 11-Dec-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't follow redirects if redirect_dir=off

Overlayfs is following redirects even when redirects are disabled. If this
is unintentional (probably the majority of cases) then this can be a
problem. E.g. upper layer comes from untrusted USB drive, and attacker
crafts a redirect to enable read access to otherwise unreadable
directories.

If "redirect_dir=off", then turn off following as well as creation of
redirects. If "redirect_dir=follow", then turn on following, but turn off
creation of redirects (which is what "redirect_dir=off" does now).

This is a backward incompatible change, so make it dependent on a config
option.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2a9c6d06 01-Nov-2017 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ovl: allocate anonymous devs for lowerdirs

Generate unique values of st_dev per lower layer for non-samefs
overlay mount. The unique values are obtained by allocating anonymous
bdevs for each of the lowerdirs in the overlayfs instance.

The anonymous bdev is going to be returned by stat(2) for lowerdir
non-dir entries in non-samefs case.

[amir: split from ovl_getattr() and re-structure patches]

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b9343632 24-Jul-2017 Chandan Rajendra <chandan@linux.vnet.ibm.com>

ovl: re-structure overlay lower layers in-memory

Define new structures to represent overlay instance lower layers and
overlay merge dir lower layers to make room for storing more per layer
information in-memory.

Instead of keeping the fs instance lower layers in an array of struct
vfsmount, keep them in an array of new struct ovl_layer, that has a
pointer to struct vfsmount.

Instead of keeping the dentry lower layers in an array of struct path,
keep them in an array of new struct ovl_path, that has a pointer to
struct dentry and to struct ovl_layer.

Add a small helper to find the fs layer id that correspopnds to a lower
struct ovl_path and use it in ovl_lookup().

[amir: split re-structure from anonymous bdev patch]

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 506458ef 24-Oct-2017 Will Deacon <will@kernel.org>

locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 85fdee1e 29-Sep-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix regression caused by exclusive upper/work dir protection

Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.

Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:

Terminal 1:

mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/

Terminal 2:

unshare -m

Terminal 1:

umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy

To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.

Documentation was updated to reflect this change.

Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 55acc661 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: add flag for upper in ovl_entry

For rename, we need to ensure that an upper alias exists for hard links
before attempting the operation. Introduce a flag in ovl_entry to track
the state of the upper alias.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 02bcd157 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: introduce the inodes index dir feature

Create the index dir on mount. The index dir will contain hardlinks to
upper inodes, named after the hex representation of their origin lower
inodes.

The index dir is going to be used to prevent breaking lower hardlinks
on copy up and to implement overlayfs NFS export.

Because the feature is not fully backward compat, enabling the feature
is opt-in by config/module/mount option.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2cac0c00 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: get exclusive ownership on upper/work dirs

Bad things can happen if several concurrent overlay mounts try to
use the same upperdir/workdir path.

Try to get the 'inuse' advisory lock on upperdir and workdir.
Fail mount if another overlay mount instance or another user
holds the 'inuse' lock on these directories.

Note that this provides no protection for concurrent overlay
mount that use overlapping (i.e. descendant) upper/work dirs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 04a01ac7 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move cache and version to ovl_inode

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a015dafc 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: use ovl_inode mutex to synchronize concurrent copy up

Use the new ovl_inode mutex to synchonize concurrent copy up
instead of the super block copy up workqueue.

Moving the synchronization object from the overlay dentry to
the overlay inode is needed for synchonizing concurrent copy up
of lower hardlinks to the same upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 13c72075 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move impure to ovl_inode

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# cf31c463 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move redirect to ovl_inode

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 09d8b586 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move __upperdentry to ovl_inode

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 25b7713a 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: use i_private only as a key

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 13cf199d 12-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: allocate an ovl_inode struct

We need some more space to store overlay inode data in memory,
so allocate overlay inodes from a slab of struct ovl_inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ee1d6d37 11-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: mark upper dir with type origin entries "impure"

When moving a merge dir or non-dir with copy up origin into a non-merge
upper dir (a.k.a pure upper dir), we are marking the target parent dir
"impure". ovl_iterate() iterates pure upper dirs directly, because there is
no need to filter out whiteouts and merge dir content with lower dir. But
for the case of an "impure" upper dir, ovl_iterate() will not be able to
iterate the real upper dir directly, because it will need to lookup the
origin inode and use it to fill d_ino.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 82b749b2 16-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: check on mount time if upper fs supports setting xattr

xattr are needed by overlayfs for setting opaque dir, redirect dir
and copy up origin.

Check at mount time by trying to set the overlay.opaque xattr on the
workdir and if that fails issue a warning message.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7bcd74b9 22-Mar-2017 Amir Goldstein <amir73il@gmail.com>

ovl: check if all layers are on the same fs

Some features can only work when all layers are on the same fs. Test this
condition during mount time, so features can check them later.

Add helper ovl_same_sb() to return the common super block in case all
layers are on the same fs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 39d3d60a 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

ovl: introduce copy up waitqueue

The overlay sb 'copyup_wq' and overlay inode 'copying' condition
variable are about to replace the upper sb rename_lock, as finer
grained synchronization objects for concurrent copy up.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# e7f52429 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

ovl: check if upperdir fs supports O_TMPFILE

This is needed for choosing between concurrent copyup
using O_TMPFILE and legacy copyup using workdir+rename.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a6c60655 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: redirect on rename-dir

Current code returns EXDEV when a directory would need to be copied up to
move. We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.

This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay. After such attribute has been set,
the directory can be moved without further actions required.

This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 02b69b28 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: lookup redirects

If a directory has the "trusted.overlay.redirect" xattr, it means that the
value of the xattr should be used to find the underlying directory on the
next lower layer.

The redirect may be relative or absolute. Absolute redirects begin with a
slash.

A relative redirect means: instead of the current dentry's name use the
value of the redirect to find the directory in the next lower
layer. Relative redirects must not contain a slash.

An absolute redirect means: look up the directory relative to the root of
the overlay using the value of the redirect in the next lower layer.

Redirects work on lower layers as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 6b2d5fe4 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: check namelen

We already calculate f_namelen in statfs as the maximum of the name lengths
provided by the filesystems taking part in the overlay.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# bbb1e54d 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: split super.c

fs/overlayfs/super.c is the biggest of the overlayfs source files and it
contains various utility functions as well as the rather complicated lookup
code. Split these parts out to separate files.

Before:

1446 fs/overlayfs/super.c

After:

919 fs/overlayfs/super.c
267 fs/overlayfs/namei.c
235 fs/overlayfs/util.c
51 fs/overlayfs/ovl_entry.h

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>