History log of /linux-master/fs/overlayfs/copy_up.c
Revision Date Author Comments
# 77a28aa4 17-Mar-2024 Amir Goldstein <amir73il@gmail.com>

ovl: relax WARN_ON in ovl_verify_area()

syzbot hit an assertion in copy up data loop which looks like it is
the result of a lower file whose size is being changed underneath
overlayfs.

This type of use case is documented to cause undefined behavior, so
returning EIO error for the copy up makes sense, but it should not be
causing a WARN_ON assertion.

Reported-and-tested-by: syzbot+3abd99031b42acf367ef@syzkaller.appspotmail.com
Fixes: ca7ab482401c ("ovl: add permission hooks outside of do_splice_direct()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 853b8d75 01-Feb-2024 Amir Goldstein <amir73il@gmail.com>

remap_range: merge do_clone_file_range() into vfs_clone_file_range()

commit dfad37051ade ("remap_range: move permission hooks out of
do_clone_file_range()") moved the permission hooks from
do_clone_file_range() out to its caller vfs_clone_file_range(),
but left all the fast sanity checks in do_clone_file_range().

This makes the expensive security hooks be called in situations
that they would not have been called before (e.g. fs does not support
clone).

The only reason for the do_clone_file_range() helper was that overlayfs
did not use to be able to call vfs_clone_file_range() from copy up
context with sb_writers lock held. However, since commit c63e56a4a652
("ovl: do not open/llseek lower file with upper sb_writers held"),
overlayfs just uses an open coded version of vfs_clone_file_range().

Merge_clone_file_range() into vfs_clone_file_range(), restoring the
original order of checks as it was before the regressing commit and adapt
the overlayfs code to call vfs_clone_file_range() before the permission
hooks that were added by commit ca7ab482401c ("ovl: add permission hooks
outside of do_splice_direct()").

Note that in the merge of do_clone_file_range(), the file_start_write()
context was reduced to cover ->remap_file_range() without holding it
over the permission hooks, which was the reason for doing the regressing
commit in the first place.

Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401312229.eddeb9a6-oliver.sang@intel.com
Fixes: dfad37051ade ("remap_range: move permission hooks out of do_clone_file_range()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202102258.1582671-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# a8b00268 20-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

rename(): avoid a deadlock in the case of parents having no common ancestor

... and fix the directory locking documentation and proof of correctness.
Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the
case where we really don't want it is splicing the root of disconnected
tree to somewhere.

In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an
ancestor of Y" only if X and Y are already in the same tree. Otherwise
it can go from false to true, and one can construct a deadlock on that.

Make lock_two_directories() report an error in such case and update the
callers of lock_rename()/lock_rename_child() to handle such errors.

And yes, such conditions are not impossible to create ;-/

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2c3ef4f8 20-Nov-2023 Amir Goldstein <amir73il@gmail.com>

ovl: initialize ovl_copy_up_ctx.destname inside ovl_do_copy_up()

The ->destname member of struct ovl_copy_up_ctx is initialized inside
ovl_copy_up_one() to ->d_name of the overlayfs dentry being copied up
and then it may be overridden by index name inside ovl_do_copy_up().

ovl_inode_lock() in ovl_copy_up_start() and ovl_copy_up() in ovl_rename()
effectively stabilze ->d_name of the overlayfs dentry being copied up,
but ovl_inode_lock() is not held when ->d_name is being read.

It is not a correctness bug, because if ovl_do_copy_up() races with
ovl_rename() and ctx.destname is freed, we will not end up calling
ovl_do_copy_up() with the dead name reference.

The code becomes much easier to understand and to document if the
initialization of c->destname is always done inside ovl_do_copy_up(),
either to the index entry name, or to the overlay dentry ->d_name.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 0f292086 12-Dec-2023 Amir Goldstein <amir73il@gmail.com>

splice: return type ssize_t from all helpers

Not sure why some splice helpers return long, maybe historic reasons.
Change them all to return ssize_t to conform to the splice methods and
to the rest of the helpers.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# da40448c 30-Nov-2023 Amir Goldstein <amir73il@gmail.com>

fs: move file_start_write() into direct_splice_actor()

The callers of do_splice_direct() hold file_start_write() on the output
file.

This may cause file permission hooks to be called indirectly on an
overlayfs lower layer, which is on the same filesystem of the output
file and could lead to deadlock with fanotify permission events.

To fix this potential deadlock, move file_start_write() from the callers
into the direct_splice_actor(), so file_start_write() will not be held
while splicing from the input file.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231128214258.GA2398475@perftesting/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# ca7ab482 22-Nov-2023 Amir Goldstein <amir73il@gmail.com>

ovl: add permission hooks outside of do_splice_direct()

The main callers of do_splice_direct() also call rw_verify_area() for
the entire range that is being copied, e.g. by vfs_copy_file_range()
or do_sendfile() before calling do_splice_direct().

The only caller that does not have those checks for entire range is
ovl_copy_up_file(). In preparation for removing the checks inside
do_splice_direct(), add rw_verify_area() call in ovl_copy_up_file().

For extra safety, perform minimal sanity checks from rw_verify_area()
for non negative offsets also in the copy up do_splice_direct() loop
without calling the file permission hooks.

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-2-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 413ba910 17-Dec-2023 Amir Goldstein <amir73il@gmail.com>

ovl: fix dentry reference leak after changes to underlying layers

syzbot excercised the forbidden practice of moving the workdir under
lowerdir while overlayfs is mounted and tripped a dentry reference leak.

Fixes: c63e56a4a652 ("ovl: do not open/llseek lower file with upper sb_writers held")
Reported-and-tested-by: syzbot+8608bb4553edb8c78f41@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 5b02bfc1 16-Aug-2023 Amir Goldstein <amir73il@gmail.com>

ovl: do not encode lower fh with upper sb_writers held

When lower fs is a nested overlayfs, calling encode_fh() on a lower
directory dentry may trigger copy up and take sb_writers on the upper fs
of the lower nested overlayfs.

The lower nested overlayfs may have the same upper fs as this overlayfs,
so nested sb_writers lock is illegal.

Move all the callers that encode lower fh to before ovl_want_write().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# c63e56a4 15-Aug-2023 Amir Goldstein <amir73il@gmail.com>

ovl: do not open/llseek lower file with upper sb_writers held

overlayfs file open (ovl_maybe_lookup_lowerdata) and overlay file llseek
take the ovl_inode_lock, without holding upper sb_writers.

In case of nested lower overlay that uses same upper fs as this overlay,
lockdep will warn about (possibly false positive) circular lock
dependency when doing open/llseek of lower ovl file during copy up with
our upper sb_writers held, because the locking ordering seems reverse to
the locking order in ovl_copy_up_start():

- lower ovl_inode_lock
- upper sb_writers

Let the copy up "transaction" keeps an elevated mnt write count on upper
mnt, but leaves taking upper sb_writers to lower level helpers only when
they actually need it. This allows to avoid holding upper sb_writers
during lower file open/llseek and prevents the lockdep warning.

Minimizing the scope of upper sb_writers during copy up is also needed
for fixing another possible deadlocks by a following patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 162d0644 19-Jul-2023 Amir Goldstein <amir73il@gmail.com>

ovl: reorder ovl_want_write() after ovl_inode_lock()

Make the locking order of ovl_inode_lock() strictly between the two
vfs stacked layers, i.e.:
- ovl vfs locks: sb_writers, inode_lock, ...
- ovl_inode_lock
- upper vfs locks: sb_writers, inode_lock, ...

To that effect, move ovl_want_write() into the helpers ovl_nlink_start()
and ovl_copy_up_start which currently take the ovl_inode_lock() after
ovl_want_write().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 03dbab3b 13-Sep-2023 Jeff Layton <jlayton@kernel.org>

overlayfs: set ctime when setting mtime and atime

Nathan reported that he was seeing the new warning in
setattr_copy_mgtime pop when starting podman containers. Overlayfs is
trying to set the atime and mtime via notify_change without also
setting the ctime.

POSIX states that when the atime and mtime are updated via utimes() that
we must also update the ctime to the current time. The situation with
overlayfs copy-up is analogies, so add ATTR_CTIME to the bitmask.
notify_change will fill in the value.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230913-ctime-v1-1-c6bc509cbc27@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# ab048302 04-Sep-2023 Amir Goldstein <amir73il@gmail.com>

ovl: fix failed copyup of fileattr on a symlink

Some local filesystems support setting persistent fileattr flags
(e.g. FS_NOATIME_FL) on directories and regular files via ioctl.
Some of those persistent fileattr flags are reflected to vfs as
in-memory inode flags (e.g. S_NOATIME).

Overlayfs uses the in-memory inode flags (e.g. S_NOATIME) on a lower file
as an indication that a the lower file may have persistent inode fileattr
flags (e.g. FS_NOATIME_FL) that need to be copied to upper file.

However, in some cases, the S_NOATIME in-memory flag could be a false
indication for persistent FS_NOATIME_FL fileattr. For example, with NFS
and FUSE lower fs, as was the case in the two bug reports, the S_NOATIME
flag is set unconditionally for all inodes.

Users cannot set persistent fileattr flags on symlinks and special files,
but in some local fs, such as ext4/btrfs/tmpfs, the FS_NOATIME_FL fileattr
flag are inheritted to symlinks and special files from parent directory.

In both cases described above, when lower symlink has the S_NOATIME flag,
overlayfs will try to copy the symlink's fileattrs and fail with error
ENOXIO, because it could not open the symlink for the ioctl security hook.

To solve this failure, do not attempt to copyup fileattrs for anything
other than directories and regular files.

Reported-by: Ruiwen Zhao <ruiwen@google.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217850
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# f01d0889 21-May-2023 Andrea Righi <andrea.righi@canonical.com>

ovl: make consistent use of OVL_FS()

Always use OVL_FS() to retrieve the corresponding struct ovl_fs from a
struct super_block.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# b0504bfe 26-Jun-2023 Amir Goldstein <amir73il@gmail.com>

ovl: add support for unique fsid per instance

The legacy behavior of ovl_statfs() reports the f_fsid filled by
underlying upper fs. This fsid is not unique among overlayfs instances
on the same upper fs.

With mount option uuid=on, generate a non-persistent uuid per overlayfs
instance and use it as the seed for f_fsid, similar to tmpfs.

This is useful for reporting fanotify events with fid info from different
instances of overlayfs over the same upper fs.

The old behavior of null uuid and upper fs fsid is retained with the
mount option uuid=null, which is the default.

The mount option uuid=off that disables uuid checks in underlying layers
also retains the legacy behavior.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 0c71faf5 19-Apr-2023 Alexander Larsson <alexl@redhat.com>

ovl: Handle verity during copy-up

During regular metacopy, if lowerdata file has fs-verity enabled, and
the verity option is enabled, we add the digest to the metacopy xattr.

If verity is required, and lowerdata does not have fs-verity enabled,
fall back to full copy-up (or the generated metacopy would not
validate).

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 184996e9 21-Jun-2023 Alexander Larsson <alexl@redhat.com>

ovl: Validate verity xattr when resolving lowerdata

The new digest field in the metacopy xattr is used during lookup to
record whether the header contained a digest in the OVL_HAS_DIGEST
flags.

When accessing file data the first time, if OVL_HAS_DIGEST is set, we
reload the metadata and check that the source lowerdata inode matches
the specified digest in it (according to the enabled verity
options). If the verity check passes we store this info in the inode
flags as OVL_VERIFIED_DIGEST, so that we can avoid doing it again if
the inode remains in memory.

The verification is done in ovl_maybe_validate_verity() which needs to
be called in the same places as ovl_maybe_lookup_lowerdata(), so there
is a new ovl_verify_lowerdata() helper that calls these in the right
order, and all current callers of ovl_maybe_lookup_lowerdata() are
changed to call it instead.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 42dd69ae 27-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: implement lazy lookup of lowerdata in data-only layers

Defer lookup of lowerdata in the data-only layers to first data access
or before copy up.

We perform lowerdata lookup before copy up even if copy up is metadata
only copy up. We can further optimize this lookup later if needed.

We do best effort lazy lookup of lowerdata for d_real_inode(), because
this interface does not expect errors. The only current in-tree caller
of d_real_inode() is trace_uprobe and this caller is likely going to be
followed reading from the file, before placing uprobes on offset within
the file, so lowerdata should be available when setting the uprobe.

Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b07d5cc9 03-Apr-2023 Amir Goldstein <amir73il@gmail.com>

ovl: update of dentry revalidate flags after copy up

After copy up, we may need to update d_flags if upper dentry is on a
remote fs and lower dentries are not.

Add helpers to allow incremental update of the revalidate flags.

Fixes: bccece1ead36 ("ovl: allow remote upper")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a1fbb607 01-Feb-2023 Christian Brauner <brauner@kernel.org>

ovl: check for ->listxattr() support

We have decoupled vfs_listxattr() from IOP_XATTR. Instead we just need
to check whether inode->i_op->listxattr is implemented.

Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 4f11ada1 24-Jan-2023 Miklos Szeredi <mszeredi@redhat.com>

ovl: fail on invalid uid/gid mapping at copy up

If st_uid/st_gid doesn't have a mapping in the mounter's user_ns, then
copy-up should fail, just like it would fail if the mounter task was doing
the copy using "cp -a".

There's a corner case where the "cp -a" would succeed but copy up fail: if
there's a mapping of the invalid uid/gid (65534 by default) in the user
namespace. This is because stat(2) will return this value if the mapping
doesn't exist in the current user_ns and "cp -a" will in turn be able to
create a file with this uid/gid.

This behavior would be inconsistent with POSIX ACL's, which return -1 for
invalid uid/gid which result in a failed copy.

For consistency and simplicity fail the copy of the st_uid/st_gid are
invalid.

Fixes: 459c7c565ac3 ("ovl: unprivieged mounts")
Cc: <stable@vger.kernel.org> # v5.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Seth Forshee <sforshee@kernel.org>


# baabaa50 24-Jan-2023 Miklos Szeredi <mszeredi@redhat.com>

ovl: fix tmpfile leak

Missed an error cleanup.

Reported-by: syzbot+fd749a7ea127a84e0ffd@syzkaller.appspotmail.com
Fixes: 2b1a77461f16 ("ovl: use vfs_tmpfile_open() helper")
Cc: <stable@vger.kernel.org> # v6.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 31acceb9 22-Sep-2022 Christian Brauner <brauner@kernel.org>

ovl: use posix acl api

Now that posix acls have a proper api us it to copy them.

All filesystems that can serve as lower or upper layers for overlayfs
have gained support for the new posix acl api in previous patches.
So switch all internal overlayfs codepaths for copying posix acls to the
new posix acl api.

Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 2b1a7746 23-Sep-2022 Miklos Szeredi <mszeredi@redhat.com>

ovl: use vfs_tmpfile_open() helper

If tmpfile is used for copy up, then use this helper to create the tmpfile
and open it at the same time. This will later allow filesystems such as
fuse to do this operation atomically.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2d343087 04-Aug-2022 Al Viro <viro@zeniv.linux.org.uk>

overlayfs: constify path

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4e3299ea 29-Jun-2022 Jason A. Donenfeld <Jason@zx2c4.com>

fs: do not compare against ->llseek

Now vfs_llseek() can simply check for FMODE_LSEEK; if it's set,
we know that ->llseek() won't be NULL and if it's not we should
just fail with -ESPIPE.

A couple of other places where we used to check for special
values of ->llseek() (somewhat inconsistently) switched to
checking FMODE_LSEEK.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b27c82e1 21-Jun-2022 Christian Brauner <brauner@kernel.org>

attr: port attribute changes to new types

Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.

This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.

Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.

The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.

We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.

Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.

The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.

Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# dad7017a 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: use ovl_path_getxattr() wrapper

Add a helper that allows to retrieve ovl xattrs from either lower or
upper layers. To stop passing mnt and dentry separately everywhere use
struct path which more accurately reflects the tight coupling between
mount and dentry in this helper. Swich over all places to pass a path
argument that can operate on either upper or lower layers. This is
needed to support idmapped base layers with overlayfs.

Some helpers are always called with an upper dentry, which is now utilized
by these helpers to create the path. Make this usage explicit by renaming
the argument to "upperdentry" and by renaming the function as well in some
cases. Also add a check in ovl_do_getxattr() to catch misuse of these
functions.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 22f289ce 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: use ovl_lookup_upper() wrapper

Introduce ovl_lookup_upper() as a simple wrapper around lookup_one().
Make it clear in the helper's name that this only operates on the upper
layer. The wrapper will take upper layer's idmapping into account when
checking permission in lookup_one().

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a15506ea 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: use ovl_do_notify_change() wrapper

Introduce ovl_do_notify_change() as a simple wrapper around
notify_change() to support idmapped layers. The helper mirrors other
ovl_do_*() helpers that operate on the upper layers.

When changing ownership of an upper object the intended ownership needs
to be mapped according to the upper layer's idmapping. This mapping is
the inverse to the mapping applied when copying inode information from
an upper layer to the corresponding overlay inode. So e.g., when an
upper mount maps files that are stored on-disk as owned by id 1001 to
1000 this means that calling stat on this object from an idmapped mount
will report the file as being owned by id 1000. Consequently in order to
change ownership of an object in this filesystem so it appears as being
owned by id 1000 in the upper idmapped layer it needs to store id 1001
on disk. The mnt mapping helpers take care of this.

All idmapping helpers are nops when no idmapped base layers are used.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5272eaf3 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: pass ofs to setattr operations

Pass down struct ovl_fs to setattr operations so we can ultimately
retrieve the relevant upper mount and take the mount's idmapping into
account when creating new filesystem objects. This is needed to support
idmapped base layers with overlay.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 576bb263 03-Apr-2022 Christian Brauner <brauner@kernel.org>

ovl: pass ofs to creation operations

Pass down struct ovl_fs to all creation helpers so we can ultimately
retrieve the relevant upper mount and take the mount's idmapping into
account when creating new filesystem objects. This is needed to support
idmapped base layers with overlay.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c914c0e2 03-Apr-2022 Amir Goldstein <amir73il@gmail.com>

ovl: use wrappers to all vfs_*xattr() calls

Use helpers ovl_*xattr() to access user/trusted.overlay.* xattrs
and use helpers ovl_do_*xattr() to access generic xattrs. This is a
preparatory patch for using idmapped base layers with overlay.

Note that a few of those places called vfs_*xattr() calls directly to
reduce the amount of debug output. But as Miklos pointed out since
overlayfs has been stable for quite some time the debug output isn't all
that relevant anymore and the additional debug in all locations was
actually quite helpful when developing this patch series.

Cc: <linux-unionfs@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 94fd1975 14-Jan-2022 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't fail copy up if no fileattr support on upper

Christoph Fritz is reporting that failure to copy up fileattr when upper
doesn't support fileattr or xattr results in a regression.

Return success in these failure cases; this reverts overlayfs to the old
behavior.

Add a pr_warn_once() in these cases to still let the user know about the
copy up failures.

Reported-by: Christoph Fritz <chf.fritz@googlemail.com>
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4ee7e4a6 12-Jan-2022 Christoph Fritz <chf.fritz@googlemail.com>

ovl: fix NULL pointer dereference in copy up warning

This patch is fixing a NULL pointer dereference to get a recently
introduced warning message working.

Fixes: 5b0a414d06c3 ("ovl: fix filattr copy-up failure")
Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5b0a414d 04-Nov-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: fix filattr copy-up failure

This regression can be reproduced with ntfs-3g and overlayfs:

mkdir lower upper work overlay
dd if=/dev/zero of=ntfs.raw bs=1M count=2
mkntfs -F ntfs.raw
mount ntfs.raw lower
touch lower/file.txt
mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work - overlay
mv overlay/file.txt overlay/file2.txt

mv fails and (misleadingly) prints

mv: cannot move 'overlay/file.txt' to a subdirectory of itself, 'overlay/file2.txt'

The reason is that ovl_copy_fileattr() is triggered due to S_NOATIME being
set on all inodes (by fuse) regardless of fileattr.

ovl_copy_fileattr() tries to retrieve file attributes from lower file, but
that fails because filesystem does not support this ioctl (this should fail
with ENOTTY, but ntfs-3g return EINVAL instead). This failure is
propagated to origial operation (in this case rename) that triggered the
copy-up.

The fix is to ignore ENOTTY and EINVAL errors from fileattr_get() in copy
up. This also requires turning the internal ENOIOCTLCMD into ENOTTY.

As a further measure to prevent unnecessary failures, only try the
fileattr_get/set on upper if there are any flags to copy up.

Side note: a number of filesystems set S_NOATIME (and sometimes other inode
flags) irrespective of fileattr flags. This causes unnecessary calls
during copy up, which might lead to a performance issue, especially if
latency is high. To fix this, the kernel would need to differentiate
between the two cases. E.g. introduce SB_NOATIME_UPDATE, a per-sb variant
of S_NOATIME. SB_NOATIME doesn't work, because that's interpreted as
"filesystem doesn't store an atime attribute"

Reported-and-tested-by: Kevin Locke <kevin@kevinlocke.name>
Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# f945ca19 22-Jul-2021 Miklos Szeredi <mszeredi@redhat.com>

ovl: use kvalloc in xattr copy-up

Extended attributes are usually small, but could be up to 64k in size, so
use the most efficient method for doing the allocation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 096a218a 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: consistent behavior for immutable/append-only inodes

When a lower file has immutable/append-only fileattr flags, the behavior of
overlayfs post copy up is inconsistent.

Immediattely after copy up, ovl inode still has the S_IMMUTABLE/S_APPEND
inode flags copied from lower inode, so vfs code still treats the ovl inode
as immutable/append-only. After ovl inode evict or mount cycle, the ovl
inode does not have these inode flags anymore.

We cannot copy up the immutable and append-only fileattr flags, because
immutable/append-only inodes cannot be linked and because overlayfs will
not be able to set overlay.* xattr on the upper inodes.

Instead, if any of the fileattr flags of interest exist on the lower inode,
we store them in overlay.protattr xattr on the upper inode and we read the
flags from xattr on lookup and on fileattr_get().

This gives consistent behavior post copy up regardless of inode eviction
from cache.

When user sets new fileattr flags, we update or remove the overlay.protattr
xattr.

Storing immutable/append-only fileattr flags in an xattr instead of upper
fileattr also solves other non-standard behavior issues - overlayfs can now
copy up children of "ovl-immutable" directories and lower aliases of
"ovl-immutable" hardlinks.

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/linux-unionfs/20201226104618.239739-1-cgxu519@mykernel.net/
Link: https://lore.kernel.org/linux-unionfs/20210210190334.1212210-5-amir73il@gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 72db8211 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: copy up sync/noatime fileattr flags

When a lower file has sync/noatime fileattr flags, the behavior of
overlayfs post copy up is inconsistent.

Immediately after copy up, ovl inode still has the S_SYNC/S_NOATIME
inode flags copied from lower inode, so vfs code still treats the ovl
inode as sync/noatime. After ovl inode evict or mount cycle,
the ovl inode does not have these inode flags anymore.

To fix this inconsistency, try to copy the fileattr flags on copy up
if the upper fs supports the fileattr_set() method.

This gives consistent behavior post copy up regardless of inode eviction
from cache.

We cannot copy up the immutable/append-only inode flags in a similar
manner, because immutable/append-only inodes cannot be linked and because
overlayfs will not be able to set overlay.* xattr on the upper inodes.

Those flags will be addressed by a followup patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a0c236b1 18-Jun-2021 Amir Goldstein <amir73il@gmail.com>

ovl: pass ovl_fs to ovl_check_setxattr()

Instead of passing the overlay dentry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7b279bbf 23-Mar-2021 Dan Carpenter <dan.carpenter@oracle.com>

ovl: fix missing revert_creds() on error path

Smatch complains about missing that the ovl_override_creds() doesn't
have a matching revert_creds() if the dentry is disconnected. Fix this
by moving the ovl_override_creds() until after the disconnected check.

Fixes: aa3ff3c152ff ("ovl: copy up of disconnected dentries")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c7c7a1a1 21-Jan-2021 Tycho Andersen <tycho@tycho.pizza>

xattr: handle idmapped mounts

When interacting with extended attributes the vfs verifies that the
caller is privileged over the inode with which the extended attribute is
associated. For posix access and posix default extended attributes a uid
or gid can be stored on-disk. Let the functions handle posix extended
attributes on idmapped mounts. If the inode is accessed through an
idmapped mount we need to map it according to the mount's user
namespace. Afterwards the checks are identical to non-idmapped mounts.
This has no effect for e.g. security xattrs since they don't store uids
or gids and don't perform permission checks on them like posix acls do.

Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 2f221d6f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 03fedf93 18-Dec-2020 Amir Goldstein <amir73il@gmail.com>

ovl: skip getxattr of security labels

When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr
calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will
intercept in inode_getxattr hooks.

When selinux LSM is installed but not initialized, it will list the
security.selinux xattr in inode_listsecurity, but will not intercept it
in inode_getxattr. This results in -ENODATA for a getxattr call for an
xattr returned by listxattr.

This situation was manifested as overlayfs failure to copy up lower
files from squashfs when selinux is built-in but not initialized,
because ovl_copy_xattr() iterates the lower inode xattrs by
vfs_listxattr() and vfs_getxattr().

ovl_copy_xattr() skips copy up of security labels that are indentified by
inode_copy_up_xattr LSM hooks, but it does that after vfs_getxattr().
Since we are not going to copy them, skip vfs_getxattr() of the security
labels.

Reported-by: Michael Labriola <michael.d.labriola@gmail.com>
Tested-by: Michael Labriola <michael.d.labriola@gmail.com>
Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.org/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 6939f977 14-Dec-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: do not fail when setting origin xattr

Comment above call already says this, but only EOPNOTSUPP is ignored, other
failures are not.

For example setting "user.*" will fail with EPERM on symlink/special.

Ignore this error as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5830fb6b 13-Oct-2020 Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

ovl: introduce new "uuid=off" option for inodes index feature

This replaces uuid with null in overlayfs file handles and thus relaxes
uuid checks for overlay index feature. It is only possible in case there is
only one filesystem for all the work/upper/lower directories and bare file
handles from this backing filesystem are unique. In other case when we have
multiple filesystems lets just fallback to "uuid=on" which is and
equivalent of how it worked before with all uuid checks.

This is needed when overlayfs is/was mounted in a container with index
enabled (e.g.: to be able to resolve inotify watch file handles on it to
paths in CRIU), and this container is copied and started alongside with the
original one. This way the "copy" container can't have the same uuid on the
superblock and mounting the overlayfs from it later would fail.

That is an example of the problem on top of loop+ext4:

dd if=/dev/zero of=loopbackfile.img bs=100M count=10
losetup -fP loopbackfile.img
losetup -a
#/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
mkfs.ext4 loopbackfile.img
mkdir loop-mp
mount -o loop /dev/loop0 loop-mp
mkdir loop-mp/{lower,upper,work,merged}
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
umount loop-mp/merged
umount loop-mp
e2fsck -f /dev/loop0
tune2fs -U random /dev/loop0

mount -o loop /dev/loop0 loop-mp
mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
#mount: /loop-test/loop-mp/merged:
#mount(2) system call failed: Stale file handle.

If you just change the uuid of the backing filesystem, overlay is not
mounting any more. In Virtuozzo we copy container disks (ploops) when
create the copy of container and we require fs uuid to be unique for a new
container.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1cdb0cb6 13-Oct-2020 Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh

This will be used in next patch to be able to change uuid checks and add
uuid nullification based on ofs->config.index for a new "uuid=off" mode.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 610afc0b 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: pass ovl_fs down to functions accessing private xattrs

This paves the way for optionally using the "user.overlay." xattr
namespace.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 26150ab5 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: drop flags argument from ovl_do_setxattr()

All callers pass zero flags to ovl_do_setxattr(). So drop this argument.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 71097047 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs

Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr()
otherwise.

This has an effect on debug output, which is made more consistent by this
patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# de7a52c9 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: clean up ovl_getxattr() in copy_up.c

Lose the padding and the failure message (in line with other parts of the
copy up process). Return zero for both nonexistent or empty xattr.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fee0f298 02-Sep-2020 Miklos Szeredi <mszeredi@redhat.com>

duplicate ovl_getxattr()

ovl_getattr() returns the value of an xattr in a kmalloced buffer. There
are two callers:

ovl_copy_up_meta_inode_data() (copy_up.c)
ovl_get_redirect_xattr() (util.c)

This patch just copies ovl_getxattr() to copy_up.c, the following patches
will deal with the differences in idividual callers.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c86243b0 31-Aug-2020 Vivek Goyal <vgoyal@redhat.com>

ovl: provide a mount option "volatile"

Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build requirement
is such that they don't care if a node goes down while build was still
going on. In that case, they will simply throw away unfinished layer and
start new build. So they don't care about syncing intermediate state to the
disk and hence don't want to pay the price associated with sync.

So they are asking for mount options where they can disable sync on overlay
mount point.

They primarily seem to have two use cases.

- For building images, they will mount overlay with nosync and then sync
upper layer after unmounting overlay and reuse upper as lower for next
layer.

- For running containers, they don't seem to care about syncing upper layer
because if node goes down, they will simply throw away upper layer and
create a fresh one.

So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if system
crashes or shuts down and start fresh.

With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to 25
seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take out the
network operations latency out of the measurement.

Giuseppe is also looking to cut down on number of iops done on the disk. He
is complaining that often in cloud their VMs are throttled if they cross
the limit. This option can help them where they reduce number of iops (by
cutting down on frequent sync and writebacks).

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3f649ab7 03-Jun-2020 Kees Cook <keescook@chromium.org>

treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>


# 5ac8e802 21-Jun-2020 youngjun <her0gyugyu@gmail.com>

ovl: change ovl_copy_up_flags static

"ovl_copy_up_flags" is used in copy_up.c.
so, change it static.

Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 520da69d 26-May-2020 Yuxuan Shui <yshuiv7@gmail.com>

ovl: initialize error in ovl_copy_xattr

In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
xattrs, the copy loop will terminate without assigning anything to the
error variable, thus returning an uninitialized value.

If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
value is put into a pointer by ERR_PTR(), causing potential invalid memory
accesses down the line.

This commit initialize error with 0. This is the correct value because when
there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
should succeed.

This bug is discovered with the help of INIT_STACK_ALL and clang.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
Fixes: 0956254a2d5b ("ovl: don't copy up opaqueness")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 773cb4c5 02-Apr-2020 Amir Goldstein <amir73il@gmail.com>

ovl: prepare to copy up without workdir

With index=on, we copy up lower hardlinks to work dir and move them into
index dir. Fix locking to allow work dir and index dir to be the same
directory.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# c61ca557 17-Mar-2020 Miklos Szeredi <mszeredi@redhat.com>

ovl: ignore failure to copy up unknown xattrs

This issue came up with NFSv4 as the lower layer, which generates
"system.nfs4_acl" xattrs (even for plain old unix permissions). Prior to
this patch this prevented copy-up from succeeding.

The overlayfs permission model mandates that permissions are checked
locally for the task and remotely for the mounter(*). NFS4 ACLs are not
supported by the Linux kernel currently, hence they cannot be enforced
locally. Which means it is indifferent whether this attribute is copied or
not.

Generalize this to any xattr that is not used in access checking (i.e. it's
not a POSIX ACL and not in the "security." namespace).

Incidentally, best effort copying of xattrs seems to also be the behavior
of "cp -a", which is what overlayfs tries to mimic.

(*) Documentation/filesystems/overlayfs.txt#Permission model

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b504c654 01-Nov-2019 Chengguang Xu <cgxu519@mykernel.net>

ovl: improving copy-up efficiency for big sparse file

Current copy-up is not efficient for big sparse file,
It's not only slow but also wasting more disk space
when the target lower file has huge hole inside.
This patch tries to recognize file hole and skip it
during copy-up.

Detail logic of hole detection as below:
When we detect next data position is larger than current
position we will skip that hole, otherwise we copy
data in the size of OVL_COPY_UP_CHUNK_SIZE. Actually,
it may not recognize all kind of holes and sometimes
only skips partial of hole area. However, it will be
enough for most of the use cases.

Additionally, this optimization relies on lseek(2)
SEEK_DATA implementation, so for some specific
filesystems which do not support this feature
will behave as before on copy-up.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1bd0a3ae 16-Dec-2019 lijiazi <jqqlijiazi@gmail.com>

ovl: use pr_fmt auto generate prefix

Use pr_fmt auto generate "overlayfs: " prefix.

Signed-off-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ec7bbb53 15-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: don't use a temp buf for encoding real fh

We can allocate maximum fh size and encode into it directly.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# cbe7fba8 15-Nov-2019 Amir Goldstein <amir73il@gmail.com>

ovl: make sure that real fid is 32bit aligned in memory

Seprate on-disk encoding from in-memory and on-wire resresentation
of overlay file handle.

In-memory and on-wire we only ever pass around pointers to struct
ovl_fh, which encapsulates at offset 3 the on-disk format struct
ovl_fb. struct ovl_fb encapsulates at offset 21 the real file handle.
That makes sure that the real file handle is always 32bit aligned
in-memory when passed down to the underlying filesystem.

On-disk format remains the same and store/load are done into
correctly aligned buffer.

New nfs exported file handles are exported with aligned real fid.
Old nfs file handles are copied to an aligned buffer before being
decoded.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d2912cb1 04-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 253e7483 17-Jun-2019 Nicolas Schier <n.schier@avm.de>

ovl: fix typo in MODULE_PARM_DESC

Change first argument to MODULE_PARM_DESC() calls, that each of them
matched the actual module parameter name. The matching results in
changing (the 'parm' section from) the output of `modinfo overlay` from:

parm: ovl_check_copy_up:Obsolete; does nothing
parm: redirect_max:ushort
parm: ovl_redirect_max:Maximum length of absolute redirect xattr value
parm: redirect_dir:bool
parm: ovl_redirect_dir_def:Default to on or off for the redirect_dir feature
parm: redirect_always_follow:bool
parm: ovl_redirect_always_follow:Follow redirects even if redirect_dir feature is turned off
parm: index:bool
parm: ovl_index_def:Default to on or off for the inodes index feature
parm: nfs_export:bool
parm: ovl_nfs_export_def:Default to on or off for the NFS export feature
parm: xino_auto:bool
parm: ovl_xino_auto_def:Auto enable xino feature
parm: metacopy:bool
parm: ovl_metacopy_def:Default to on or off for the metadata only copy up feature

into:

parm: check_copy_up:Obsolete; does nothing
parm: redirect_max:Maximum length of absolute redirect xattr value (ushort)
parm: redirect_dir:Default to on or off for the redirect_dir feature (bool)
parm: redirect_always_follow:Follow redirects even if redirect_dir feature is turned off (bool)
parm: index:Default to on or off for the inodes index feature (bool)
parm: nfs_export:Default to on or off for the NFS export feature (bool)
parm: xino_auto:Auto enable xino feature (bool)
parm: metacopy:Default to on or off for the metadata only copy up feature (bool)

Signed-off-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3428030d 21-Jan-2019 Amir Goldstein <amir73il@gmail.com>

ovl: fix missing upper fs freeze protection on copy up for ioctl

Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file
with data before FS_IOC_SETFLAGS ioctl.

The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably
caused the confusion. File may be open O_RDONLY, but ioctl modifies the
file. VFS does not call mnt_want_write_file() nor lock inode mutex, but
fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls
mnt_want_write_file() for the overlay file, and fs-specific code calls
mnt_want_write_file() for upper fs file, but there was no call for
ovl_want_write() for copy up duration which prevents overlayfs from copying
up on a frozen upper fs.

Fixes: dab5ca8fd9dd ("ovl: add lsattr/chattr support")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 993a0b2a 30-Jan-2019 Vivek Goyal <vgoyal@redhat.com>

ovl: Do not lose security.capability xattr over metadata file copy-up

If a file has been copied up metadata only, and later data is copied up,
upper loses any security.capability xattr it has (underlying filesystem
clears it as upon file write).

From a user's point of view, this is just a file copy-up and that should
not result in losing security.capability xattr. Hence, before data copy
up, save security.capability xattr (if any) and restore it on upper after
data copy up is complete.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 0c2888749363 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5f32879e 11-Jan-2019 Vivek Goyal <vgoyal@redhat.com>

ovl: During copy up, first copy up data and then xattrs

If a file with capability set (and hence security.capability xattr) is
written kernel clears security.capability xattr. For overlay, during file
copy up if xattrs are copied up first and then data is, copied up. This
means data copy up will result in clearing of security.capability xattr
file on lower has. And this can result into surprises. If a lower file has
CAP_SETUID, then it should not be cleared over copy up (if nothing was
actually written to file).

This also creates problems with chown logic where it first copies up file
and then tries to clear setuid bit. But by that time security.capability
xattr is already gone (due to data copy up), and caller gets -ENODATA.
This has been reported by Giuseppe here.

https://github.com/containers/libpod/issues/2015#issuecomment-447824842

Fix this by copying up data first and then metadta. This is a regression
which has been introduced by my commit as part of metadata only copy up
patches.

TODO: There will be some corner cases where a file is copied up metadata
only and later data copy up happens and that will clear security.capability
xattr. Something needs to be done about that too.

Fixes: bd64e57586d3 ("ovl: During copy up, first copy up metadata and then data")
Cc: <stable@vger.kernel.org> # v4.19+
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 452ce659 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: plumb remap flags through the vfs clone functions

Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 42ec3d4c 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: make remap_file_range functions take and return bytes completed

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 6b52243f 26-Oct-2018 Miklos Szeredi <mszeredi@redhat.com>

ovl: fold copy-up helpers into callers

Now that the workdir and tmpfile copy up modes have been untagled, the
functions become simple enough that the helpers can be folded into the
callers.

Add new helpers where there is any duplication remaining: preparing creds
for creating the object.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b10cdcdc 07-Oct-2018 Amir Goldstein <amir73il@gmail.com>

ovl: untangle copy up call chain

In an attempt to dedup ~100 LOC, we ended up creating a tangled call chain,
whose branches merge and diverge in several points according to the
immutable c->tmpfile copy up mode.

This call chain was hard to analyse for locking correctness because the
locking requirements for the c->tmpfile flow were very different from the
locking requirements for the !c->tmpfile flow (i.e. directory vs. regulare
file copy up).

Split the copy up helpers of the c->tmpfile flow from those of the
!c->tmpfile (i.e. workdir) flow and remove the c->tmpfile mode from copy up
context.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 1f244dc5 26-Oct-2018 Miklos Szeredi <mszeredi@redhat.com>

ovl: clean up error handling in ovl_get_tmpfile()

If security_inode_copy_up() fails, it should not set new_creds, so no need
for the cleanup (which would've Oops-ed anyway, due to old_creds being
NULL).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a725356b 18-Sep-2018 Amir Goldstein <amir73il@gmail.com>

vfs: swap names of {do,vfs}_clone_file_range()

Commit 031a072a0b8a ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.

The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.

It seems that commit 8ede205541ff ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.

Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 989974c8 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Enable metadata only feature

All the bits are in patches before this. So it is time to enable the
metadata only copy up feature.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d1e6f6a9 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: add helper to force data copy-up

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4f93b426 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Copy up meta inode data from lowest data inode

So far lower could not be a meta inode. So whenever it was time to copy up
data of a meta inode, we could copy it up from top most lower dentry.

But now lower itself can be a metacopy inode. That means data copy up
needs to take place from a data inode in metacopy inode chain. Find lower
data inode in the chain and use that for data copy up.

Introduced a helper called ovl_path_lowerdata() to find the lower data
inode chain.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0c288874 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: A new xattr OVL_XATTR_METACOPY for file on upper

Now we will have the capability to have upper inodes which might be only
metadata copy up and data is still on lower inode. So add a new xattr
OVL_XATTR_METACOPY to distinguish between two cases.

Presence of OVL_XATTR_METACOPY reflects that file has been copied up
metadata only and and data will be copied up later from lower origin. So
this xattr is set when a metadata copy takes place and cleared when data
copy takes place.

We also use a bit in ovl_inode->flags to cache OVL_UPPERDATA which reflects
whether ovl inode has data or not (as opposed to metadata only copy up).

If a file is copied up metadata only and later when same file is opened for
WRITE, then data copy up takes place. We copy up data, remove METACOPY
xattr and then set the UPPERDATA flag in ovl_inode->flags. While all these
operations happen with oi->lock held, read side of oi->flags can be
lockless. That is another thread on another cpu can check if UPPERDATA
flag is set or not.

So this gives us an ordering requirement w.r.t UPPERDATA flag. That is, if
another cpu sees UPPERDATA flag set, then it should be guaranteed that
effects of data copy up and remove xattr operations are also visible.

For example.

CPU1 CPU2
ovl_open() acquire(oi->lock)
ovl_open_maybe_copy_up() ovl_copy_up_data()
open_open_need_copy_up() vfs_removexattr()
ovl_already_copied_up()
ovl_dentry_needs_data_copy_up() ovl_set_flag(OVL_UPPERDATA)
ovl_test_flag(OVL_UPPERDATA) release(oi->lock)

Say CPU2 is copying up data and in the end sets UPPERDATA flag. But if
CPU1 perceives the effects of setting UPPERDATA flag but not the effects of
preceding operations (ex. upper that is not fully copied up), it will be a
problem.

Hence this patch introduces smp_wmb() on setting UPPERDATA flag operation
and smp_rmb() on UPPERDATA flag test operation.

May be some other lock or barrier is already covering it. But I am not sure
what that is and is it obvious enough that we will not break it in future.

So hence trying to be safe here and introducing barriers explicitly for
UPPERDATA flag/bit.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2002df85 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Add helper ovl_already_copied_up()

There are couple of places where we need to know if file is already copied
up (in lockless manner). Right now its open coded and there are only two
conditions to check. Soon this patch series will introduce another
condition to check and Amir wants to introduce one more. So introduce a
helper instead to check this so that code is easier to read.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 44d5bf10 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Copy up only metadata during copy up where it makes sense

If it makes sense to copy up only metadata during copy up, do it. This is
done for regular files which are not opened for WRITE.

Right now ->metacopy is set to 0 always. Last patch in the series will
remove the hard coded statement and enable metacopy feature.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# bd64e575 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: During copy up, first copy up metadata and then data

Just a little re-ordering of code. This helps with next patch where after
copying up metadata, we skip data copying step, if needed.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d6eac039 11-May-2018 Vivek Goyal <vgoyal@redhat.com>

ovl: Move the copy up helpers to copy_up.c

Right now two copy up helpers are in inode.c. Amir suggested it might be
better to move these to copy_up.c.

There will one more related function which will come in later patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 670c2324 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

ovl: obsolete "check_copy_up" module option

This was provided for debugging the ro/rw inconsistecy. The inconsitency
is now gone so this option is obsolete.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b148cba4 31-May-2018 Miklos Szeredi <mszeredi@redhat.com>

ovl: clean up copy-up error paths

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 137ec526 16-May-2018 Amir Goldstein <amir73il@gmail.com>

ovl: create helper ovl_create_temp()

Also used ovl_create_temp() in ovl_create_index() instead of calling
ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed
through ovl_create_real(), which paves the way for Al's fix for non-hashed
result from vfs_mkdir().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 95a1c815 16-May-2018 Miklos Szeredi <mszeredi@redhat.com>

ovl: return dentry from ovl_create_real()

Al Viro suggested to simplify callers of ovl_create_real() by
returning the created dentry (or ERR_PTR) from ovl_create_real().

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 471ec5dc 16-May-2018 Amir Goldstein <amir73il@gmail.com>

ovl: struct cattr cleanups

* Rename to ovl_cattr

* Fold ovl_create_real() hardlink argument into struct ovl_cattr

* Create macro OVL_CATTR() to initialize struct ovl_cattr from mode

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 6cf00764 16-May-2018 Amir Goldstein <amir73il@gmail.com>

ovl: strip debug argument from ovl_do_ helpers

It did not prove to be useful.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5b2cccd3 02-Feb-2018 Amir Goldstein <amir73il@gmail.com>

ovl: disambiguate ovl_encode_fh()

Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
exportfs function ovl_encode_inode_fh() and change the latter to
ovl_encode_fh() to match the exportfs method name.

Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# aa3ff3c1 15-Oct-2017 Amir Goldstein <amir73il@gmail.com>

ovl: copy up of disconnected dentries

With NFS export, some operations on decoded file handles (e.g. open,
link, setattr, xattr_set) may call copy up with a disconnected non-dir.
In this case, we will copy up lower inode to index dir without
linking it to upper dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 016b720f 11-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: index directories on copy up for NFS export

With the NFS export feature enabled, all dirs are indexed on copy up.
Non-dir files are copied up directly to indexdir and then hardlinked
to upper dir.

Directories are copied up to indexdir, then an index entry is created
in indexdir with 'upper' xattr pointing to the copied up dir and then
the copied up dir is moved to upper dir.

Directory index is also used for consistency verification, like
detecting multiple redirected dirs to the same lower dir on lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 24b33ee1 25-Sep-2017 Amir Goldstein <amir73il@gmail.com>

ovl: create ovl_need_index() helper

The helper determines which lower file needs to be indexed
on copy up and before nlink changes.

For index=on, the helper evaluates to true for lower hardlinks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 05122443 10-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: generalize ovl_verify_origin() and helpers

Remove the "origin" language from the functions that handle set, get
and verify of "origin" xattr and pass the xattr name as an argument.

The same helpers are going to be used for NFS export to get, get and
verify the "upper" xattr for directory index entries.

ovl_verify_origin() is now a helper used only to verify non upper
file handle stored in "origin" xattr of upper inode.

The upper root dir file handle is still stored in "origin" xattr on
the index dir for backward compatibility. This is going to be changed
by the patch that adds directory index entries support.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 9678e630 03-Jan-2018 Amir Goldstein <amir73il@gmail.com>

ovl: fix inconsistent d_ino for legacy merge dir

For a merge dir that was copied up before v4.12 or that was hand crafted
offline (e.g. mkdir {upper/lower}/dir), upper dir does not contain the
'trusted.overlay.origin' xattr. In that case, stat(2) on the merge dir
returns the lower dir st_ino, but getdents(2) returns the upper dir d_ino.

After this change, on merge dir lookup, missing origin xattr on upper
dir will be fixed and 'impure' xattr will be fixed on parent of the legacy
merge dir.

Suggested-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ee023c30 30-Oct-2017 Amir Goldstein <amir73il@gmail.com>

ovl: move include of ovl_entry.h into overlayfs.h

Most overlayfs c files already explicitly include ovl_entry.h
to use overlay entry struct definitions and upcoming changes
are going to require even more c files to include this header.

All overlayfs c files include overlayfs.h and overlayfs.h itself
refers to some structs defined in ovl_entry.h, so it seems more
logic to include ovl_entry.h from overlayfs.h than from c files.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b79e05aa 25-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: no direct iteration for dir with origin xattr

If a non-merge dir in an overlay mount has an overlay.origin xattr, it
means it was once an upper merge dir, which may contain whiteouts and
then the lower dir was removed under it.

Do not iterate real dir directly in this case to avoid exposing whiteouts.

[SzM] Set OVL_WHITEOUT for all merge directories as well.

[amir] A directory that was just copied up does not have the OVL_WHITEOUTS
flag. We need to set it to fix merge dir iteration.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5820dc08 25-Sep-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix missing unlock_rename() in ovl_do_copy_up()

Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0ee931c4 13-Sep-2017 Michal Hocko <mhocko@suse.com>

mm: treewide: remove GFP_TEMPORARY allocation flag

GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation. As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.

The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory. So
this is rather misleading and hard to evaluate for any benefits.

I have checked some random users and none of them has added the flag
with a specific justification. I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring. This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.

I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL. Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.

I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.

This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
seems to be a heuristic without any measured advantage for most (if not
all) its current users. The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers. So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.

[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4439de1 04-Jul-2017 Amir Goldstein <amir73il@gmail.com>

ovl: mark parent impure and restore timestamp on ovl_link_up()

Signed-off-by: Amir Goldstein <amir73il@gmail.com>


# 5f8415d6 20-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: persistent overlay inode nlink for indexed inodes

With inodes index enabled, an overlay inode nlink counts the union of upper
and non-covered lower hardlinks. During the lifetime of a non-pure upper
inode, the following nlink modifying operations can happen:

1. Lower hardlink copy up
2. Upper hardlink created, unlinked or renamed over
3. Lower hardlink whiteout or renamed over

For the first, copy up case, the union nlink does not change, whether the
operation succeeds or fails, but the upper inode nlink may change.
Therefore, before copy up, we store the union nlink value relative to the
lower inode nlink in the index inode xattr trusted.overlay.nlink.

For the second, upper hardlink case, the union nlink should be incremented
or decremented IFF the operation succeeds, aligned with nlink change of the
upper inode. Therefore, before link/unlink/rename, we store the union nlink
value relative to the upper inode nlink in the index inode.

For the last, lower cover up case, we simplify things by preceding the
whiteout or cover up with copy up. This makes sure that there is an index
upper inode where the nlink xattr can be stored before the copied up upper
entry is unlink.

Return the overlay inode nlinks for indexed upper inodes on stat(2).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 59be0971 20-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: implement index dir copy up

Implement a copy up method for non-dir objects using index dir to
prevent breaking lower hardlinks on copy up.

This method requires that the inodes index dir feature was enabled and
that all underlying fs support file handle encoding/decoding.

On the first lower hardlink copy up, upper file is created in index dir,
named after the hex representation of the lower origin inode file handle.
On the second lower hardlink copy up, upper file is found in index dir,
by the same lower handle key.
On either case, the upper indexed inode is then linked to the copy up
upper path.

The index entry remains linked for future lower hardlink copy up and for
lower to upper inode map, that is needed for exporting overlayfs to NFS.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fd210b7d 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move copy up lock out

Move ovl_copy_up_start()/ovl_copy_up_end() out so that it's used for both
tempfile and workdir copy ups.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a6fb235a 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: rearrange copy up

Split up and rearrange copy up functions to make them better readable.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 55acc661 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: add flag for upper in ovl_entry

For rename, we need to ensure that an upper alias exists for hard links
before attempting the operation. Introduce a flag in ovl_entry to track
the state of the upper alias.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 23f0ab13 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: use struct copy_up_ctx as function argument

This cleans up functions with too many arguments.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7ab8b176 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: base tmpfile in workdir too

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 02209d10 19-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: factor out ovl_copy_up_inode() helper

Factor out helper for copying lower inode data and metadata to temp
upper inode, that is common to copy up using O_TMPFILE and workdir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 7d90b853 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: extract helper to get temp file in copy up

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 15932c41 15-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: defer upper dir lock to tempfile link

On copy up of regular file using an O_TMPFILE, lock upper dir only
before linking the tempfile in place.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 54fb347e 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: verify index dir matches upper dir

An index dir contains persistent hardlinks to files in upper dir.
Therefore, we must never mount an existing index dir with a differnt
upper dir.

Store the upper root dir file handle in index dir inode when index
dir is created and verify the file handle before using an existing
index dir on mount.

Add an 'is_upper' flag to the overlay file handle encoding and set it
when encoding the upper root file handle. This is not critical for index
dir verification, but it is good practice towards a standard overlayfs
file handle format for NFS export.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 8b88a2e6 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: verify upper root dir matches lower root dir

When inodes index feature is enabled, verify that the file handle stored
in upper root dir matches the lower root dir or fail to mount.

If upper root dir has no stored file handle, encode and store the lower
root dir file handle in overlay.origin xattr.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 02bcd157 21-Jun-2017 Amir Goldstein <amir73il@gmail.com>

ovl: introduce the inodes index dir feature

Create the index dir on mount. The index dir will contain hardlinks to
upper inodes, named after the hex representation of their origin lower
inodes.

The index dir is going to be used to prevent breaking lower hardlinks
on copy up and to implement overlayfs NFS export.

Because the feature is not fully backward compat, enabling the feature
is opt-in by config/module/mount option.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 09d8b586 04-Jul-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: move __upperdentry to ovl_inode

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fbaf94ee 28-Jun-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't set origin on broken lower hardlink

When copying up a file that has multiple hard links we need to break any
association with the origin file. This makes copy-up be essentially an
atomic replace.

The new file has nothing to do with the old one (except having the same
data and metadata initially), so don't set the overlay.origin attribute.

We can relax this in the future when we are able to index upper object by
origin.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e80 ("ovl: store file handle of lower inode on copy up")


# e85f82ff 28-Jun-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: copy-up: don't unlock between lookup and link

Nothing prevents mischief on upper layer while we are busy copying up the
data.

Move the lookup right before the looked up dentry is actually used.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 01ad3eb8a073 ("ovl: concurrent copy up of regular files")
Cc: <stable@vger.kernel.org> # v4.11


# 01633fd2 17-May-2017 Christoph Hellwig <hch@lst.de>

overlayfs: use uuid_t instead of uuid_be

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>


# 85787090 10-May-2017 Christoph Hellwig <hch@lst.de>

fs: switch ->s_uuid to uuid_t

For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>


# f3a15685 24-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: mark upper merge dir with type origin entries "impure"

An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.

We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.

Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.

This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3d27573c 19-May-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: remove unused arg from ovl_lookup_temp()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 6266d465 18-May-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't fail copy-up if upper doesn't support xattr

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 8137ae26 15-May-2017 Amir Goldstein <amir73il@gmail.com>

ovl: fix creds leak in copy up error path

Fixes: 42f269b92540 ("ovl: rearrange code in ovl_copy_up_locked()")
Cc: <stable@vger.kernel.org> # v4.11
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3a1e819b 30-Mar-2017 Amir Goldstein <amir73il@gmail.com>

ovl: store file handle of lower inode on copy up

Sometimes it is interesting to know if an upper file is pure upper or a
copy up target, and if it is a copy up target, it may be interesting to
find the copy up origin.

This will be used to preserve lower inode numbers across copy up.

Store the lower inode file handle in upper inode extended attribute
overlay.origin on copy up to use it later for these cases. Store the lower
filesystem uuid along side the file handle, so we can validate that we are
looking for the origin file in the original fs.

If lower fs does not support NFS export ops store a zero sized xattr so we
can always use the overlay.origin xattr to distinguish between a copy up
and a pure upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5b825c3a 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 01ad3eb8 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

ovl: concurrent copy up of regular files

Now that copy up of regular file is done using O_TMPFILE,
we don't need to hold rename_lock throughout copy up.

Use the copy up waitqueue to synchronize concurrent copy up
of the same file. Different regular files can be copied up
concurrently.

The upper dir inode_lock is taken instead of rename_lock,
because it is needed for lookup and later for linking the
temp file, but it is released while copying up data.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# d8514d8e 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

ovl: copy up regular file using O_TMPFILE

In preparation for concurrent copy up, implement copy up
of regular file as O_TMPFILE that is linked to upperdir
instead of a file in workdir that is moved to upperdir.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 42f269b9 16-Jan-2017 Amir Goldstein <amir73il@gmail.com>

ovl: rearrange code in ovl_copy_up_locked()

As preparation to implementing copy up with O_TMPFILE,
name the variable for dentry before final rename 'temp' and
assign it to 'newdentry' only after rename.

Also lookup upper dentry before looking up temp dentry and
move ovl_set_timestamps() into ovl_copy_up_locked(), because
that is going to be more convenient for upcoming change.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 32a3d848 04-Dec-2016 Al Viro <viro@ZenIV.linux.org.uk>

ovl: clean up kstat usage

FWIW, there's a bit of abuse of struct kstat in overlayfs object
creation paths - for one thing, it ends up with a very small subset
of struct kstat (mode + rdev), for another it also needs link in
case of symlinks and ends up passing it separately.

IMO it would be better to introduce a separate object for that.

In principle, we might even lift that thing into general API and switch
->mkdir()/->mknod()/->symlink() to identical calling conventions. Hell
knows, perhaps ->create() as well...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 9aba6521 12-Nov-2016 Amir Goldstein <amir73il@gmail.com>

ovl: fold ovl_copy_up_truncate() into ovl_copy_up()

This removes code duplication.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5cf5b477 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: opaque cleanup

oe->opaque is set for

a) whiteouts
b) directories having the "trusted.overlay.opaque" xattr

Case b can be simplified, since setting the xattr always implies setting
oe->opaque. Also once set, the opaque flag is never cleared.

Don't need to set opaque flag for non-directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# a6c60655 16-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: redirect on rename-dir

Current code returns EXDEV when a directory would need to be copied up to
move. We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.

This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay. After such attribute has been set,
the directory can be moved without further actions required.

This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2ea98466 23-Sep-2016 Amir Goldstein <amir73il@gmail.com>

ovl: use vfs_clone_file_range() for copy up if possible

When copying up within the same fs, try to use vfs_clone_file_range().
This is very efficient when lower and upper are on the same fs
with file reflink support. If vfs_clone_file_range() fails for any
reason, copy up falls back to the regular data copy code.

Tested correct behavior when lower and upper are on:
1. same ext4 (copy)
2. same xfs + reflink patches + mkfs.xfs (copy)
3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)

For comparison, on my laptop, xfstest overlay/001 (copy up of large
sparse files) takes less than 1 second in the xfs reflink setup vs.
25 seconds on the rest of the setups.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 45063097 04-Dec-2016 Al Viro <viro@zeniv.linux.org.uk>

don't open-code file_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 641089c1 31-Oct-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: fsync after copy-up

Make sure the copied up file hits the disk before renaming to the final
destination. If this is not done then the copy-up may corrupt the data in
the file in case of a crash.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>


# 7764235b 04-Oct-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: use vfs_get_link()

Resulting in a complete removal of a function basically implementing the
inverse of vfs_readlink().

As a bonus, now the proper security hook is also called.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5d6c3191 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

xattr: Add __vfs_{get,set,remove}xattr helpers

Right now, various places in the kernel check for the existence of
getxattr, setxattr, and removexattr inode operations and directly call
those operations. Switch to helper functions and test for the IOP_XATTR
flag instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8eac98b8 06-Sep-2016 Vivek Goyal <vgoyal@redhat.com>

ovl: during copy up, switch to mounter's creds early

Now, we have the notion that copy up of a file is done with the creds
of mounter of overlay filesystem (as opposed to task). Right now before
we switch creds, we do some vfs_getattr() operations in the context of
task and that itself can fail. We should do that getattr() using the
creds of mounter instead.

So this patch switches to mounter's creds early during copy up process so
that even vfs_getattr() is done with mounter's creds.

Do not call revert_creds() unless we have already called
ovl_override_creds(). [Reported by Arnd Bergmann]

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 8b326c61 16-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: copy_up_xattr(): use strnlen

Be defensive about what underlying fs provides us in the returned xattr
list buffer. strlen() may overrun the buffer, so use strnlen() and WARN if
the contents are not properly null terminated.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>


# 121ab822 13-Jul-2016 Vivek Goyal <vgoyal@redhat.com>

security,overlayfs: Provide security hook for copy up of xattrs for overlay file

Provide a security hook which is called when xattrs of a file are being
copied up. This hook is called once for each xattr and LSM can return
0 if the security module wants the xattr to be copied up, 1 if the
security module wants the xattr to be discarded on the copy, -EOPNOTSUPP
if the security module does not handle/manage the xattr, or a -errno
upon an error.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: whitespace cleanup for checkpatch.pl]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# d8ad8b49 13-Jul-2016 Vivek Goyal <vgoyal@redhat.com>

security, overlayfs: provide copy up security hook for unioned files

Provide a security hook to label new file correctly when a file is copied
up from lower layer to upper layer of a overlay/union mount.

This hook can prepare a new set of creds which are suitable for new file
creation during copy up. Caller will use new creds to create file and then
revert back to old creds and release new creds.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: whitespace cleanup to appease checkpatch.pl]
Signed-off-by: Paul Moore <paul@paul-moore.com>


# 0956254a 08-Aug-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: don't copy up opaqueness

When a copy up of a directory occurs which has the opaque xattr set, the
xattr remains in the upper directory. The immediate behavior with overlayfs
is that the upper directory is not treated as opaque, however after a
remount the opaque flag is used and upper directory is treated as opaque.
This causes files created in the lower layer to be hidden when using
multiple lower directories.

Fix by not copying up the opaque flag.

To reproduce:

----8<---------8<---------8<---------8<---------8<---------8<----
mkdir -p l/d/s u v w mnt
mount -t overlay overlay -olowerdir=l,upperdir=u,workdir=w mnt
rm -rf mnt/d/
mkdir -p mnt/d/n
umount mnt
mount -t overlay overlay -olowerdir=u:l,upperdir=v,workdir=w mnt
touch mnt/d/foo
umount mnt
mount -t overlay overlay -olowerdir=u:l,upperdir=v,workdir=w mnt
ls mnt/d
----8<---------8<---------8<---------8<---------8<---------8<----

output should be: "foo n"

Reported-by: Derek McGowan <dmcg@drizz.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151291
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>


# 39b681f8 28-Jul-2016 Miklos Szeredi <mszeredi@redhat.com>

ovl: store real inode pointer in ->i_private

To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.

Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.

Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 3fe6e52f 07-Apr-2016 Antonio Murdaca <amurdaca@redhat.com>

ovl: override creds with the ones from the superblock mounter

In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0) = 0
unlinkat(4, "1", 0) = 0
unlinkat(4, "3", 0) = 0
close(4) = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace. The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# f134f244 15-Mar-2016 Sohom Bhattacharjee <soham.bhattacharjee15@gmail.com>

ovl: fixed coding style warning

This patch fixes a newline warning found by the checkpatch.pl tool

Signed-off-by: Sohom-Bhattacharjee <soham.bhattacharjee15@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# fb5bb2c3 07-Jul-2015 David Howells <dhowells@redhat.com>

ovl: Warn on copy up if a process has a R/O fd open to the lower file

Print a warning when overlayfs copies up a file if the process that
triggered the copy up has a R/O fd open to the lower file being copied up.

This can help catch applications that do things like the following:

fd1 = open("foo", O_RDONLY);
fd2 = open("foo", O_RDWR);

where they expect fd1 and fd2 to refer to the same file - which will no
longer be the case post-copy up.

With this patch, the following commands:

bash 5</mnt/a/foo128
6<>/mnt/a/foo128

assuming /mnt/a/foo128 to be an un-copied up file on an overlay will
produce the following warning in the kernel log:

overlayfs: Copying up foo129, but open R/O on fd 5 which will cease
to be coherent [pid=3818 bash]

This is enabled by setting:

/sys/module/overlay/parameters/check_copy_up

to 1.

The warnings are ratelimited and are also limited to one warning per file -
assuming the copy up completes in each case.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5955102c 22-Jan-2016 Al Viro <viro@zeniv.linux.org.uk>

wrappers for ->i_mutex access

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0f7ff2da 05-Dec-2015 Al Viro <viro@zeniv.linux.org.uk>

ovl: get rid of the dead code left from broken (and disabled) optimizations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e4ad29fa 24-Oct-2015 Vito Caputo <vito.caputo@coreos.com>

ovl: use a minimal buffer in ovl_copy_xattr

Rather than always allocating the high-order XATTR_SIZE_MAX buffer
which is costly and prone to failure, only allocate what is needed and
realloc if necessary.

Fixes https://github.com/coreos/bugs/issues/489

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>


# 97daf8b9 10-Nov-2015 Miklos Szeredi <miklos@szeredi.hu>

ovl: allow zero size xattr

When ovl_copy_xattr() encountered a zero size xattr no more xattrs were
copied and the function returned success. This is clearly not the desired
behavior.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>


# ab79efab 18-Sep-2015 David Howells <dhowells@redhat.com>

ovl: fix dentry reference leak

In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.

The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.

Without this fix, something like the following is seen:

BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs]
BUG: Dentry ffff880023ece640{i=0,n=bigfile} still in use (1) [unmount of tmpfs tmpfs]

when unmounting the upper layer after an error occurred in copyup.

An error can be induced by creating a big file in a lower layer with
something like:

dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

to create a large file (4.1G). Overlay an upper layer that is too small
(on tmpfs might do) and then induce a copy up by opening it writably.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+


# 0480334f 18-Sep-2015 David Howells <dhowells@redhat.com>

ovl: use O_LARGEFILE in ovl_copy_up()

Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+


# cc6f67bc 19-May-2015 Miklos Szeredi <mszeredi@suse.cz>

ovl: mount read-only if workdir can't be created

OpenWRT folks reported that overlayfs fails to mount if upper fs is full,
because workdir can't be created. Wordir creation can fail for various
other reasons too.

There's no reason that the mount itself should fail, overlayfs can work
fine without a workdir, as long as the overlay isn't modified.

So mount it read-only and don't allow remounting read-write.

Add a couple of WARN_ON()s for the impossible case of workdir being used
despite being read-only.

Reported-by: Bastian Bittorf <bittorf@bluebottle.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.18+


# 1ba38725 26-Nov-2014 hujianyang <hujianyang@huawei.com>

ovl: Cleanup redundant blank lines

This patch removes redundant blanks lines in overlayfs.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 1afaba1e 12-Dec-2014 Miklos Szeredi <mszeredi@suse.cz>

ovl: make path-type a bitmap

OVL_PATH_PURE_UPPER -> __OVL_PATH_UPPER | __OVL_PATH_PURE
OVL_PATH_UPPER -> __OVL_PATH_UPPER
OVL_PATH_MERGE -> __OVL_PATH_UPPER | __OVL_PATH_MERGE
OVL_PATH_LOWER -> 0

Multiple R/O layers will allow __OVL_PATH_MERGE without __OVL_PATH_UPPER.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# e9be9d5e 23-Oct-2014 Miklos Szeredi <mszeredi@suse.cz>

overlay filesystem

Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay

The following cotributions have been folded into this patch:

Neil Brown <neilb@suse.de>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <nbd@openwrt.org>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts

Erez Zadok <ezk@fsl.cs.sunysb.edu>
- fix cleanup after WARN_ON

Sedat Dilek <sedat.dilek@googlemail.com>
- fix up permission to confirm to new API

Robin Dong <hao.bigrat@gmail.com>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link

Andy Whitcroft <apw@canonical.com>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode

AV:
- ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
- ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
lack of check for udir being the parent of upper, dropping and regaining
the lock on udir (which would require _another_ check for parent being
right).
- bogus d_drop() in copyup and rename [fix from your mail]
- copyup/remove and copyup/rename races [fix from your mail]
- ovl_dir_fsync() leaving ERR_PTR() in ->realfile
- ovl_entry_free() is pointless - it's just a kfree_rcu()
- fold ovl_do_lookup() into ovl_lookup()
- manually assigning ->d_op is wrong. Just use ->s_d_op.
[patches picked from Miklos]:
* copyup/remove and copyup/rename races
* bogus d_drop() in copyup and rename

Also thanks to the following people for testing and reporting bugs:

Jordi Pujol <jordipujolp@gmail.com>
Andy Whitcroft <apw@canonical.com>
Michal Suchanek <hramrach@centrum.cz>
Felix Fietkau <nbd@openwrt.org>
Erez Zadok <ezk@fsl.cs.sunysb.edu>
Randy Dunlap <rdunlap@xenotime.net>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>