History log of /linux-master/fs/ceph/inode.c
Revision Date Author Comments
# bbb20ea9 17-Jan-2024 Xiubo Li <xiubli@redhat.com>

ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT

The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.

Link: https://tracker.ceph.com/issues/64035
Fixes: 5b1188847180 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 100ccd18 24-Nov-2023 David Howells <dhowells@redhat.com>

netfs: Optimise away reads above the point at which there can be no data

Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.

Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:

(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.

(2) On local inode creation, we set zero_point to 0.

(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.

(4) On local truncation up, we don't change zero_point.

(5) On local modification, we don't change zero_point.

(6) On remote invalidation, we set zero_point to the new i_size.

(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.

(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.

(9) If a direct I/O write is made, set zero_point above that.

Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.

The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org


# c9c4ff12 27-Nov-2023 David Howells <dhowells@redhat.com>

netfs: Move pinning-for-writeback from fscache to netfs

Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.

Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.

Note that this mechanism could be more generically useful than that for
network filesystems. Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org


# a04aff25 07-Aug-2023 Christian Brauner <brauner@kernel.org>

ceph: allow idmapped setattr inode op

Enable __ceph_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

[ aleksandr.mikhalitsyn: adapted to b27c82e12965 ("attr: port attribute
changes to new types") ]

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 79c66a0c 07-Aug-2023 Alexander Mikhalitsyn <alexander@mihalicyn.com>

ceph: pass idmap to __ceph_setattr

Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8995375f 07-Aug-2023 Christian Brauner <brauner@kernel.org>

ceph: allow idmapped permission inode op

Enable ceph_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0513043e 07-Aug-2023 Christian Brauner <brauner@kernel.org>

ceph: allow idmapped getattr inode op

Enable ceph_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's idmapping.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 38d46409 11-Jun-2023 Xiubo Li <xiubli@redhat.com>

ceph: print cluster fsid and client global_id in all debug logs

Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5995d90d 11-Jun-2023 Xiubo Li <xiubli@redhat.com>

ceph: rename _to_client() to _to_fs_client()

We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c453bdb5 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

ceph: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-22-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 15c0a870 06-Sep-2023 Xiubo Li <xiubli@redhat.com>

ceph: fix incorrect revoked caps assert in ceph_fill_file_size()

When truncating the inode the MDS will acquire the xlock for the
ifile Locker, which will revoke the 'Frwsxl' caps from the clients.
But when the client just releases and flushes the 'Fw' caps to MDS,
for exmaple, and once the MDS receives the caps flushing msg it
just thought the revocation has finished. Then the MDS will continue
truncating the inode and then issued the truncate notification to
all the clients. While just before the clients receives the cap
flushing ack they receive the truncation notification, the clients
will detecte that the 'issued | dirty' is still holding the 'Fw'
caps.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56693
Fixes: b0d7c2231015 ("ceph: introduce i_truncate_mutex")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 295fc4aa 06-Mar-2023 Xiubo Li <xiubli@redhat.com>

ceph: fix updating i_truncate_pagecache_size for fscrypt

When fscrypt is enabled we will align the truncate size up to the
CEPH_FSCRYPT_BLOCK_SIZE always, so if we truncate the size in the
same block more than once, the latter ones will be skipped being
invalidated from the page caches.

This will force invalidating the page caches by using the smaller
size than the real file size.

At the same time add more debug log and fix the debug log for
truncate code.

Link: https://tracker.ceph.com/issues/58834
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dd66df00 25-Aug-2022 Luís Henriques <lhenriques@suse.de>

ceph: add support for encrypted snapshot names

Since filenames in encrypted directories are encrypted and shown as
a base64-encoded string when the directory is locked, make snapshot
names show a similar behaviour.

When creating a snapshot, .snap directories for every subdirectory will
show the snapshot name in the "long format":

# mkdir .snap/my-snap
# ls my-dir/.snap/
_my-snap_1099511627782

Encrypted snapshots will need to be able to handle these by
encrypting/decrypting only the snapshot part of the string ('my-snap').

Also, since the MDS prevents snapshot names to be bigger than 240
characters it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate
this extra limitation.

[ idryomov: drop const on !CONFIG_FS_ENCRYPTION branch too ]

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5c64737d 25-Aug-2022 Xiubo Li <xiubli@redhat.com>

ceph: add truncate size handling support for fscrypt

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.

The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 16be62fc 25-Aug-2022 Jeff Layton <jlayton@kernel.org>

ceph: size handling in MClientRequest, cap updates and inode traces

For encrypted inodes, transmit a rounded-up size to the MDS as the
normal file size and send the real inode size in fscrypt_file field.
Also, fix up creates and truncates to also transmit fscrypt_file.

When we get an inode trace from the MDS, grab the fscrypt_file field if
the inode is encrypted, and use it to populate the i_size field instead
of the regular inode size field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e127e030 03-Nov-2022 Luís Henriques <lhenriques@suse.de>

ceph: allow encrypting a directory while not having Ax caps

If a client doesn't have Fx caps on a directory, it will get errors while
trying encrypt it:

ceph: handle_cap_grant: cap grant attempt to change fscrypt_auth on non-I_NEW inode (old len 0 new len 48)
fscrypt (ceph, inode 1099511627812): Error -105 getting encryption context

A simple way to reproduce this is to use two clients:

client1 # mkdir /mnt/mydir

client2 # ls /mnt/mydir

client1 # fscrypt encrypt /mnt/mydir
client1 # echo hello > /mnt/mydir/world

This happens because, in __ceph_setattr(), we only initialize
ci->fscrypt_auth if we have Ax and ceph_fill_inode() won't use the
fscrypt_auth received if the inode state isn't I_NEW. Fix it by allowing
ceph_fill_inode() to also set ci->fscrypt_auth if the inode doesn't have
it set already.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 94af0470 01-Jul-2021 Jeff Layton <jlayton@kernel.org>

ceph: add some fscrypt guardrails

Add the appropriate calls into fscrypt for various actions, including
link, rename, setattr, and the open codepaths.

Disable fallocate for encrypted inodes -- hopefully, just for now.

If we have an encrypted inode, then the client will need to re-encrypt
the contents of the new object. Disable copy offload to or from
encrypted inodes.

Set i_blkbits to crypto block size for encrypted inodes -- some of the
underlying infrastructure for fscrypt relies on i_blkbits being aligned
to crypto blocksize.

Report STATX_ATTR_ENCRYPTED on encrypted inodes.

[ lhenriques: forbid encryption with striped layouts ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 79f2f6ad 03-Sep-2020 Jeff Layton <jlayton@kernel.org>

ceph: create symlinks with encrypted and base64-encoded targets

When creating symlinks in encrypted directories, encrypt and
base64-encode the target with the new inode's key before sending to the
MDS.

When filling a symlinked inode, base64-decode it into a buffer that
we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer
into a new one that will hang off i_link.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# af9ffa6d 13-Mar-2022 Xiubo Li <xiubli@redhat.com>

ceph: add support to readdir for encrypted names

To make it simpler to decrypt names in a readdir reply (i.e. before
we have a dentry), add a new ceph_encode_encrypted_fname()-like helper
that takes a qstr pointer instead of a dentry pointer.

Once we've decrypted the names in a readdir reply, we no longer need the
crypttext, so overwrite them in ceph_mds_reply_dir_entry with the
unencrypted names. Then in both ceph_readdir_prepopulate() and
ceph_readdir() we will use the dencrypted name directly.

[ jlayton: convert some BUG_ONs into error returns ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 85529096 10-Aug-2020 Jeff Layton <jlayton@kernel.org>

ceph: make ceph_fill_trace and ceph_get_name decrypt names

When we get a dentry in a trace, decrypt the name so we can properly
instantiate the dentry or fill out ceph_get_name() buffer.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6b5717bd 08-Sep-2020 Jeff Layton <jlayton@kernel.org>

ceph: implement -o test_dummy_encryption mount option

Add support for the test_dummy_encryption mount option. This allows us
to test the encrypted codepaths in ceph without having to manually set
keys, etc.

[ lhenriques: fix potential fsc->fsc_dummy_enc_policy memory leak in
ceph_real_mount() ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2d332d5b 27-Jul-2020 Jeff Layton <jlayton@kernel.org>

ceph: fscrypt_auth handling for ceph

Most fscrypt-enabled filesystems store the crypto context in an xattr,
but that's problematic for ceph as xatts are governed by the XATTR cap,
but we really want the crypto context as part of the AUTH cap.

Because of this, the MDS has added two new inode metadata fields:
fscrypt_auth and fscrypt_file. The former is used to hold the crypto
context, and the latter is used to track the real file size.

Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now,
we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt
context.

Allow the client to use a setattr request for setting the fscrypt_auth
field. Since this is not a standard setattr request from the VFS, we add
a new field to __ceph_setattr that carries ceph-specific inode attrs.

Have the set_context op do a setattr that sets the fscrypt_auth value,
and get_context just return the contents of that field (since it should
always be available).

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ec9595c0 26-Aug-2020 Jeff Layton <jlayton@kernel.org>

ceph: preallocate inode for ops that may create one

When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.

Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.

In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).

The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0d72b928 07-Aug-2023 Jeff Layton <jlayton@kernel.org>

fs: pass the request_mask to generic_fillattr

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 7795aef0 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

ceph: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-28-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# f6102994 05-Aug-2022 Jeff Layton <jlayton@kernel.org>

ceph: report the inode version in getattr if requested

When getattr requests the STX_CHANGE_COOKIE, request the full gamut of
caps (similarly to how ctime is handled). When the change attribute
seems to be valid, return it in the change_cookie field and set the flag
in the reply mask. Also, unconditionally enable
STATX_ATTR_CHANGE_MONOTONIC.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 13e83a49 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->set_acl() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# b74d24f7 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->getattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1632a0f 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->setattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# e4b731cc 17-Oct-2022 Xiubo Li <xiubli@redhat.com>

ceph: remove useless session parameter for check_caps()

The session parameter makes no sense any more.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# cac2f8b8 22-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: rename current get acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
-> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 138060ba 23-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: pass dentry to set acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# f86a4866 13-Oct-2022 Dan Carpenter <error27@gmail.com>

ceph: fix a NULL vs IS_ERR() check when calling ceph_lookup_inode()

The ceph_lookup_inode() function returns error pointers. It never
returns NULL.

Fixes: aa87052dd965 ("ceph: fix incorrectly showing the .snap size for stat")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 81895a65 05-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use prandom_u32_max() when possible, part 1

Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}

@drop_var@
type T;
identifier VAR;
@@

{
- T VAR;
... when != VAR
}

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>


# aa87052d 30-Aug-2022 Xiubo Li <xiubli@redhat.com>

ceph: fix incorrectly showing the .snap size for stat

We should set the 'stat->size' to the real number of snapshots for
snapdirs.

Link: https://tracker.ceph.com/issues/57342
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b4b924c7 24-Aug-2022 Jeff Layton <jlayton@kernel.org>

ceph: increment i_version when doing a setattr with caps

When the client has enough caps to satisfy a setattr locally without
having to talk to the server, we currently do the setattr without
incrementing the change attribute.

Ensure that if the ctime changes locally, then the change attribute
does too.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6eb06c46 26-Jul-2022 Xiubo Li <xiubli@redhat.com>

ceph: fail the request if the peer MDS doesn't support getvxattr op

Just fail the request instead sending the request out, or the peer
MDS will crash.

Link: https://tracker.ceph.com/issues/56529
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 48490776 06-Jun-2022 Xiubo Li <xiubli@redhat.com>

ceph: don't get the inline data for new creating files

If the 'i_inline_version' is 1, that means the file is just new
created and there shouldn't have any inline data in it, we should
skip retrieving the inline data from MDS.

This also could help reduce possiblity of dead lock issue introduce
by the inline data and Fcr caps.

Gradually we will remove the inline feature from kclient after ceph's
scrub too have support to unline the inline data, currently this
could help reduce the teuthology test failures.

This is possiblly could also fix a bug that for some old clients if
they couldn't explictly uninline the inline data when writing, the
inline version will keep as 1 always. We may always reading non-exist
data from inline data.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8266c4d7 26-May-2022 Xiubo Li <xiubli@redhat.com>

ceph: choose auth MDS for getxattr with the Xs caps

And for the 'Xs' caps for getxattr we will also choose the auth MDS,
because the MDS side code is buggy due to setxattr won't notify the
replica MDSes when the values changed and the replica MDS will return
the old values. Though we will fix it in MDS code, but this still
makes sense for old ceph.

Link: https://tracker.ceph.com/issues/55331
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e81fb419 09-Jun-2022 Linus Torvalds <torvalds@linux-foundation.org>

netfs: Further cleanups after struct netfs_inode wrapper introduced

Change the signature of netfs helper functions to take a struct netfs_inode
pointer rather than a struct inode pointer where appropriate, thereby
relieving the need for the network filesystem to convert its internal inode
format down to the VFS inode only for netfslib to bounce it back up. For
type safety, it's better not to do that (and it's less typing too).

Give netfs_write_begin() an extra argument to pass in a pointer to the
netfs_inode struct rather than deriving it internally from the file
pointer. Note that the ->write_begin() and ->write_end() ops are intended
to be replaced in the future by netfslib code that manages this without the
need to call in twice for each page.

netfs_readpage() and similar are intended to be pointed at directly by the
address_space_operations table, so must stick to the signature dictated by
the function pointers there.

Changes
=======
- Updated the kerneldoc comments and documentation [DH].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/


# 874c8ca1 09-Jun-2022 David Howells <dhowells@redhat.com>

netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 507160f4 09-Jun-2022 Linus Torvalds <torvalds@linux-foundation.org>

netfs: gcc-12: temporarily disable '-Wattribute-warning' for now

This is a pure band-aid so that I can continue merging stuff from people
while some of the gcc-12 fallout gets sorted out.

In particular, gcc-12 is very unhappy about the kinds of pointer
arithmetic tricks that netfs does, and that makes the fortify checks
trigger in afs and ceph:

In function ‘fortify_memset_chk’,
inlined from ‘netfs_i_context_init’ at include/linux/netfs.h:327:2,
inlined from ‘afs_set_netfs_context’ at fs/afs/inode.c:61:2,
inlined from ‘afs_root_iget’ at fs/afs/inode.c:543:2:
include/linux/fortify-string.h:258:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
258 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and the reason is that netfs_i_context_init() is passed a 'struct inode'
pointer, and then it does

struct netfs_i_context *ctx = netfs_i_context(inode);

memset(ctx, 0, sizeof(*ctx));

where that netfs_i_context() function just does pointer arithmetic on
the inode pointer, knowing that the netfs_i_context is laid out
immediately after it in memory.

This is all truly disgusting, since the whole "netfs_i_context is laid
out immediately after it in memory" is not actually remotely true in
general, but is just made to be that way for afs and ceph.

See for example fs/cifs/cifsglob.h:

struct cifsInodeInfo {
struct {
/* These must be contiguous */
struct inode vfs_inode; /* the VFS's inode record */
struct netfs_i_context netfs_ctx; /* Netfslib context */
};
[...]

and realize that this is all entirely wrong, and the pointer arithmetic
that netfs_i_context() is doing is also very very wrong and wouldn't
give the right answer if netfs_ctx had different alignment rules from a
'struct inode', for example).

Anyway, that's just a long-winded way to say "the gcc-12 warning is
actually quite reasonable, and our code happens to work but is pretty
disgusting".

This is getting fixed properly, but for now I made the mistake of
thinking "the week right after the merge window tends to be calm for me
as people take a breather" and I did a sustem upgrade. And I got gcc-12
as a result, so to continue merging fixes from people and not have the
end result drown in warnings, I am fixing all these gcc-12 issues I hit.

Including with these kinds of temporary fixes.

Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/all/AEEBCF5D-8402-441D-940B-105AA718C71F@chromium.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55ab5520 27-Apr-2022 Luís Henriques <lhenriques@suse.de>

ceph: fix statfs for subdir mounts

When doing a mount using as base a directory that has 'max_bytes' quotas
statfs uses that value as the total; if a subdirectory is used instead,
the same 'max_bytes' too in statfs, unless there is another quota set.

Unfortunately, if this subdirectory only has the 'max_files' quota set,
then statfs uses the filesystem total. Fix this by making sure we only
lookup realms that contain the 'max_bytes' quota.

Cc: Ryan Taylor <rptaylor@uvic.ca>
URL: https://tracker.ceph.com/issues/55090
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5eed80fb 20-Apr-2022 Xiubo Li <xiubli@redhat.com>

ceph: try to choose the auth MDS if possible for getattr

If any 'x' caps is issued we can just choose the auth MDS instead
of the random replica MDSes. Because only when the Locker is in
LOCK_EXEC state will the loner client could get the 'x' caps. And
if we send the getattr requests to any replica MDS it must auth pin
and tries to rdlock from the auth MDS, and then the auth MDS need
to do the Locker state transition to LOCK_SYNC. And after that the
lock state will change back.

This cost much when doing the Locker state transition and usually
will need to revoke caps from clients.

URL: https://tracker.ceph.com/issues/55240
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 261998c3 10-Apr-2022 Xiubo Li <xiubli@redhat.com>

ceph: fix statx AT_STATX_DONT_SYNC vs AT_STATX_FORCE_SYNC check

From the posix and the initial statx supporting commit comments,
the AT_STATX_DONT_SYNC is a lightweight stat and the
AT_STATX_FORCE_SYNC is a heaverweight one. And also checked all
the other current usage about these two flags they are all doing
the same, that is only when the AT_STATX_FORCE_SYNC is not set
and the AT_STATX_DONT_SYNC is set will they skip sync retriving
the attributes from storage.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 68e5ec2e 06-Apr-2022 Xiubo Li <xiubli@redhat.com>

ceph: no need to invalidate the fscache twice

Fixes: 400e1286c0ec3 ("ceph: conversion to new fscache API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7ffe4fce 29-Mar-2022 Xiubo Li <xiubli@redhat.com>

ceph: update the dlease for the hashed dentry when removing

The MDS will always refresh the dentry lease when removing the files
or directories. And if the dentry is still hashed, we can update
the dentry lease and no need to do the lookup from the MDS later.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bc899ee1 29-Jun-2021 David Howells <dhowells@redhat.com>

netfs: Add a netfs inode context

Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

struct my_inode {
struct {
/* These must be contiguous */
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
};

The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:

struct netfs_i_context {
...
struct fscache_cookie *cache;
};

Three functions are provided to help with this:

(1) void netfs_i_context_init(struct inode *inode,
const struct netfs_request_ops *ops);

Initialise the netfs context and set the operations.

(2) struct netfs_i_context *netfs_i_context(struct inode *inode);

Find the netfs context from the VFS inode.

(3) struct inode *netfs_inode(struct netfs_i_context *ctx);

Find the VFS inode from the netfs context.

Changes
=======
ver #4)
- Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
cache is present[3].
- Fix netfs_skip_folio_read() to zero out all of the page, not just some
of it[3].

ver #3)
- Split out the bit to move ceph cap-getting on readahead into
ceph_init_request()[1].
- Stick in a comment to the netfs inode structs indicating the contiguity
requirements[2].

ver #2)
- Adjust documentation to match.
- Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
- Move the cap check from ceph_readahead() to ceph_init_request() to be
called from netfslib.
- Remove ceph_readahead() and use netfs_readahead() directly instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com

Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4


# 322794d3 01-Mar-2022 Xiubo Li <xiubli@redhat.com>

ceph: fix inode reference leakage in ceph_get_snapdir()

The ceph_get_inode() will search for or insert a new inode into the
hash for the given vino, and return a reference to it. If new is
non-NULL, its reference is consumed.

We should release the reference when in error handing cases.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6ddf5f16 13-Feb-2022 Milind Changire <milindchangire@gmail.com>

ceph: add getvxattr op

Problem:
Some directory vxattrs (e.g. ceph.dir.pin.random) are governed by
information that isn't necessarily shared with the client. Add support
for the new GETVXATTR operation, which allows the client to query the
MDS directly for vxattrs.
When the client is queried for a vxattr that doesn't have a special
handler, have it issue a GETVXATTR to the MDS directly.

Solution:
Adds new getvxattr op to fetch ceph.dir.pin*, ceph.dir.layout* and
ceph.file.layout* vxattrs.
If the entire layout for a dir or a file is being set, then it is
expected that the layout be set in standard JSON format. Individual
field value retrieval is not wrapped in JSON. The JSON format also
applies while setting the vxattr if the entire layout is being set in
one go.
As a temporary measure, setting a vxattr can also be done in the old
format. The old format will be deprecated in the future.

URL: https://tracker.ceph.com/issues/51062
Signed-off-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 810313c5 10-Feb-2022 hongnanli <hongnan.li@linux.alibaba.com>

ceph: fix comments mentioning i_mutex

inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.

Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fd60b288 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

fs: allocate inode by using alloc_inode_sb()

The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 400e1286 07-Dec-2021 Jeff Layton <jlayton@kernel.org>

ceph: conversion to new fscache API

Now that the fscache API has been reworked and simplified, change ceph
over to use it.

With the old API, we would only instantiate a cookie when the file was
open for reads. Change it to instantiate the cookie when the inode is
instantiated and call use/unuse when the file is opened/closed.

Also, ensure we resize the cached data on truncates, and invalidate the
cache in response to the appropriate events. This will allow us to
plumb in write support later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4


# e90334e8 08-Oct-2021 Xiubo Li <xiubli@redhat.com>

ceph: ignore the truncate when size won't change with Fx caps issued

If the new size is the same as the current size, the MDS will do nothing
but change the mtime/atime. POSIX doesn't mandate that the filesystems
must update them in this case, so just ignore it instead.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5d6451b1 31-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: shut down access to inode when async create fails

Add proper error handling for when an async create fails. The inode
never existed, so any dirty caps or data are now toast. We already
d_drop the dentry in that case, but the now-stale inode may still be
around. We want to shut down access to these inodes, and ensure that
they can't harbor any more dirty data, which can cause problems at
umount time.

When this occurs, flag such inodes as being SHUTDOWN, and trash any caps
and cap flushes that may be in flight for them, and invalidate the
pagecache for the inode. Add a new helper that can check whether an
inode or an entire mount is now shut down, and call it instead of
accessing the mount_state directly in places where we test that now.

URL: https://tracker.ceph.com/issues/51279
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6407fbb9 02-Sep-2021 Jeff Layton <jlayton@kernel.org>

ceph: print inode numbers instead of pointer values

We have a lot of log messages that print inode pointer values. This is
of dubious utility. Switch a random assortment of the ones I've found
most useful to use ceph_vinop to print the snap:inum tuple instead.

[ idryomov: use . as a separator, break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1bd85aa6 07-Oct-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix handling of "meta" errors

Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0ba92e1c 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: add ceph_change_snap_realm() helper

Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 23c2c76e 03-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: eliminate ceph_async_iput()

Now that we don't need to hold session->s_mutex or the snap_rwsem when
calling ceph_check_caps, we can eliminate ceph_async_iput and just use
normal iput calls.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 52d60f8e 03-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: eliminate session->s_gen_ttl_lock

Turn s_cap_gen field into an atomic_t, and just rely on the fact that we
hold the s_mutex when changing the s_cap_ttl field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7e65624d 09-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: allow ceph_put_mds_session to take NULL or ERR_PTR

...to simplify some error paths.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 27171ae6 01-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: must hold snap_rwsem when filling inode for async create

...and add a lockdep assertion for it to ceph_fill_inode().

Cc: stable@vger.kernel.org # v5.7+
Fixes: 9a8d03ca2e2c3 ("ceph: attempt to do async create when possible")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d4f6b31d 01-Apr-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't allow access to MDS-private inodes

The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.

Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.

URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2d6795fb 09-Apr-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix up some bare fetches of i_size

We need to use i_size_read(), which properly handles the torn read
case on 32-bit arches.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e7f72952 27-Aug-2020 Yanhu Cao <gmayyyha@gmail.com>

ceph: support getting ceph.dir.rsnaps vxattr

Add support for grabbing the rsnaps value out of the inode info in
traces, and exposing that via ceph.dir.rsnaps xattr.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d3c51ae1 01-Mar-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't clobber i_snap_caps on non-I_NEW inode

We want the snapdir to mirror the non-snapped directory's attributes for
most things, but i_snap_caps represents the caps granted on the snapshot
directory by the MDS itself. A misbehaving MDS could issue different
caps for the snapdir and we lose them here.

Only reset i_snap_caps when the inode is I_NEW. Also, move the setting
of i_op and i_fop inside the if block since they should never change
anyway.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 10a7052c 21-Jan-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix fscache invalidation

Ensure that we invalidate the fscache whenever we invalidate the
pagecache.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ed94f87c 25-Feb-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't allow type or device number to change on non-I_NEW inodes

Al pointed out that a malicious or broken MDS could change the type or
device number of a given inode number. It may also be possible for the
MDS to reuse an old inode number.

Ensure that we never allow fill_inode to change the type part of the
i_mode or the i_rdev unless I_NEW is set. Throw warnings if the MDS ever
changes these on us mid-stream, and return an error.

Don't set i_rdev directly, and rely on init_special_inode to do it.
Also, fix up error handling in the callers of ceph_get_inode.

In handle_cap_grant, check for and warn if the inode type changes, and
only overwrite the mode if it didn't.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3e10a15f 25-Feb-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix up error handling with snapdirs

There are several warts in the snapdir error handling. The -EOPNOTSUPP
return in __snapfh_to_dentry is currently lost, and the call to
ceph_handle_snapdir is not currently checked at all.

Fix all of this up and eliminate a BUG_ON in ceph_get_snapdir. We can
handle that case with a warning and return an error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 0d56a451 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

stat: handle idmapped mounts

The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# e65ce2a5 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

acl: handle idmapped mounts

The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 2f221d6f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 47291baa 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

namei: make permission helpers idmapped mount aware

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# a8810cdc 10-Dec-2020 Jeff Layton <jlayton@kernel.org>

ceph: allow queueing cap/snap handling after putting cap references

Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 64f28c62 09-Oct-2020 Jeff Layton <jlayton@kernel.org>

ceph: clean up inode work queueing

Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0f51a983 09-Dec-2020 Jeff Layton <jlayton@kernel.org>

ceph: don't reach into request header for readdir info

We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bca9fc14 12-Nov-2020 Jeff Layton <jlayton@kernel.org>

ceph: when filling trace, call ceph_get_inode outside of mutexes

Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
while the read is in flight

2) the same inode is evicted from the cache, and this task gets stuck
waiting for the page lock because of the above readahead

3) another task is processing a reply trace, and looks up the inode
being evicted while holding the s_mutex. That ends up waiting for the
eviction to complete

4) a write reply for an unrelated inode is then processed in the
ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so we deadlock
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

[ idryomov: break unnecessarily long lines ]

URL: https://tracker.ceph.com/issues/47998
Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 68cbb805 12-Nov-2020 Jeff Layton <jlayton@kernel.org>

ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 04fabb11 09-Nov-2020 Jeff Layton <jlayton@kernel.org>

ceph: ensure we have Fs caps when fetching dir link count

The link count for a directory is defined as inode->i_subdirs + 2,
(for "." and ".."). i_subdirs is only populated when Fs caps are held.
Ensure we grab Fs caps when fetching the link count for a directory.

[ idryomov: break unnecessarily long line ]

URL: https://tracker.ceph.com/issues/48125
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 50c9132d 25-Sep-2020 Jeff Layton <jlayton@kernel.org>

ceph: add new RECOVER mount_state when recovering session

When recovering a session (a'la recover_session=clean), we want to do
all of the operations that we do on a forced umount, but changing the
mount state to SHUTDOWN is can cause queued MDS requests to fail when
the session comes back. Most of those can idle until the session is
recovered in this situation.

Reserve SHUTDOWN state for forced umount, and make a new RECOVER state
for the forced reconnect situation. Change several tests for equality with
SHUTDOWN to test for that or RECOVER.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1dd8d470 03-Sep-2020 Xiubo Li <xiubli@redhat.com>

ceph: metrics for opened files, pinned caps and opened inodes

In client for each inode, it may have many opened files and may
have been pinned in more than one MDS servers. And some inodes
are idle, which have no any opened files.

This patch will show these metrics in the debugfs, likes:

item total
-----------------------------------------
opened files / total inodes 14 / 5
pinned i_caps / total inodes 7 / 5
opened inodes / total inodes 3 / 5

Will send these metrics to ceph, which will be used by the `fs top`,
later.

[ jlayton: drop unrelated hunk, count hashed inodes instead of
allocated ones ]

URL: https://tracker.ceph.com/issues/47005
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2678da88 03-Sep-2020 Xiubo Li <xiubli@redhat.com>

ceph: add ceph_sb_to_mdsc helper support to parse the mdsc

This will help simplify the code.

[ jlayton: fix minor merge conflict in quota.c ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ebce3eb2 18-Aug-2020 Jeff Layton <jlayton@kernel.org>

ceph: fix inode number handling on arches with 32-bit ino_t

Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).

I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).

What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.

Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.

The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.

While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.

NOTE: This is a user-visible change on 32-bit arches:

1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.

2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.

The workaround for both problems is to mount with "-o ino32".

[ idryomov: changelog tweak ]

URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1af16d54 19-Mar-2020 Xiubo Li <xiubli@redhat.com>

ceph: add caps perf metric for each superblock

Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ef915725 11-Mar-2020 Luis Henriques <lhenriques@suse.com>

ceph: fix snapshot directory timestamps

The .snap directory timestamps are kept at 0 (1970-01-01 00:00), which
isn't consistent with what the fuse client does. This patch makes the
behaviour consistent, by setting these timestamps (atime, btime, ctime,
mtime) to those of the parent directory.

Cc: Marc Roos <M.Roos@f1-outsourcing.eu>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bf73c62e 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: check all mds' caps after page writeback

If an inode has caps from multiple mds's, the following can happen:

- non-auth mds revokes Fsc. Fcb is used, so page writeback is queued.
- when writeback finishes, ceph_check_caps() is called with auth only
flag. ceph_check_caps() invalidates pagecache, but skips checking any
non-auth caps.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 135e671e 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: simplify calling of ceph_get_fmode()

Originally, calling ceph_get_fmode() for open files is by thread that
handles request reply. There is a small window between updating caps and
and waking the request initiator. We need to prevent ceph_check_caps()
from releasing wanted caps in the window.

Previous patches made fill_inode() call __ceph_touch_fmode() for open file
requests. This prevented ceph_check_caps() from releasing wanted caps for
'caps_wanted_delay_min' seconds, enough for request initiator to get
woken up and call ceph_get_fmode().

This allows us to now call ceph_get_fmode() in ceph_open() instead.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a0d93e32 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: remove delay check logic from ceph_check_caps()

__ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and
'caps_wanted_delay_max'. There is no need to duplicate the logic in
ceph_check_caps() and __send_cap()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 719a2514 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: consider inode's last read/write when calculating wanted caps

Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
used to track the last time the client acquired read/write caps for
the inode.

If there is no read/write on an inode for 'caps_wanted_delay_max'
seconds, __ceph_caps_file_wanted() does not request caps for read/write
even there are open files.

Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
calculates dir's wanted caps according to last dir read/modification. If
there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.

Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
readdir, as with that, modifications do not need to release
CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 785892fe 02-Jan-2020 Jeff Layton <jlayton@kernel.org>

ceph: cache layout in parent dir on first sync create

If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 966c7160 05-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: make ceph_fill_inode non-static

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f5e17aed 18-Feb-2020 Jeff Layton <jlayton@kernel.org>

ceph: track primary dentry link

Newer versions of the MDS will flag a dentry as "primary". In later
patches, we'll need to consult this info, so track it in di->flags.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3bb48b41 02-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: add flag to designate that a request is asynchronous

...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f85122af 02-Apr-2019 Jeff Layton <jlayton@kernel.org>

ceph: add refcounting for Fx caps

In future patches we'll be taking and relying on Fx caps. Add proper
refcounting for them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 893e456b 11-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: don't clear I_NEW until inode metadata is fully populated

Currently, we could have an open-by-handle (or NFS server) call
into the filesystem and start working with an inode before it's
properly filled out.

Don't clear I_NEW until we have filled out the inode, and discard it
properly if that fails. Note that we occasionally take an extra
reference to the inode to ensure that we don't put the last reference in
discard_new_inode, but rather leave it for ceph_async_iput.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 9a6bed4f 05-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: ensure we have a new cap before continuing in fill_inode

If the caller passes in a NULL cap_reservation, and we can't allocate
one then ensure that we fail gracefully.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 57c21994 04-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: drop unused ttl_from parameter from fill_inode

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1f08529c 29-Oct-2019 Al Viro <viro@zeniv.linux.org.uk>

ceph: add missing check in d_revalidate snapdir handling

We should not play with dcache without parent locked...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c62498d7 25-Jul-2019 Jeff Layton <jlayton@kernel.org>

ceph: update the mtime when truncating up

If we have Fx caps, and the we're truncating the size to be larger, then
we'll cache the size attribute change, but the mtime won't be updated.

Move the size handling before the mtime, and add ATTR_MTIME to ia_valid
in that case to make sure the mtime also gets updated.

This fixes xfstest generic/313.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f4b97866 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: track and report error of async metadata operation

Use errseq_t to track and report errors of async metadata operations,
similar to how kernel handles errors during writeback.

If any dirty caps or any unsafe request gets dropped during session
eviction, record -EIO in corresponding inode's i_meta_err. The error
will be reported by subsequent fsync,

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 75067034 23-Jul-2019 Luis Henriques <lhenriques@suse.com>

ceph: fix directories inode i_blkbits initialization

When filling an inode with info from the MDS, i_blkbits is being
initialized using fl_stripe_unit, which contains the stripe unit in
bytes. Unfortunately, this doesn't make sense for directories as they
have fl_stripe_unit set to '0'. This means that i_blkbits will be set
to 0xff, causing an UBSAN undefined behaviour in i_blocksize():

UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12
shift exponent 255 is too large for 32-bit type 'int'

Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit
is zero.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# af8a85a4 19-Jul-2019 Luis Henriques <lhenriques@suse.com>

ceph: fix buffer free while holding i_ceph_lock in fill_inode()

Calling ceph_buffer_put() in fill_inode() may result in freeing the
i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by
postponing the call until later, when the lock is released.

The following backtrace was triggered by fstests generic/070.

BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
6 locks held by kworker/0:4/3852:
#0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
#1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
#2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
#3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
#4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
#5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
fill_inode.isra.0+0xa9b/0xf70
ceph_fill_trace+0x13b/0xc70
? dispatch+0x2eb/0x1476
dispatch+0x320/0x1476
? __mutex_unlock_slowpath+0x4d/0x2a0
ceph_con_workfn+0xc97/0x2ec0
? process_one_work+0x1b8/0x5f0
process_one_work+0x244/0x5f0
worker_thread+0x4d/0x3e0
kthread+0x105/0x140
? process_one_work+0x5f0/0x5f0
? kthread_park+0x90/0x90
ret_from_fork+0x3a/0x50

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 52dd0f1b 05-Jul-2019 Luis Henriques <lhenriques@suse.com>

ceph: use generic_delete_inode() for ->drop_inode

ceph_drop_inode() implementation is not any different from the generic
function, thus there's no point in keeping it around.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 87bc5b89 01-Jun-2019 Yan, Zheng <zyan@redhat.com>

ceph: use ceph_evict_inode to cleanup inode's resource

remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freeing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40102
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a35ead31 06-Jun-2019 Jeff Layton <jlayton@kernel.org>

ceph: add change_attr field to ceph_inode_info

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 58981784 04-Jun-2019 Jeff Layton <jlayton@kernel.org>

ceph: allow querying of STATX_BTIME in ceph_getattr

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 245ce991 29-May-2019 Jeff Layton <jlayton@kernel.org>

ceph: add btime field to ceph_inode_info

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ac6713cc 26-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: add selinux support

When creating new file/directory, use security_dentry_init_security() to
prepare selinux context for the new inode, then send openc/mkdir request
to MDS, together with selinux xattr.

security_dentry_init_security() only supports single security module and
only selinux has dentry_init_security hook. So only selinux is supported
for now. We can add support for other security modules once kernel has a
generic version of dentry_init_security()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d6e47819 22-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: hold i_ceph_lock when removing caps for freeing inode

ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 543212b3 22-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: close race between d_name_cmp() and update_dentry_lease()

d_name_cmp() and update_dentry_lease() lock and unlock dentry->d_lock
respectively. Dentry may get renamed between them. The fix is moving
the dentry name compare into update_dentry_lease().

This patch introduce two version of update_dentry_lease(). One version
is for the case that parent inode is locked. It does not need to check
parent/target inode and dentry name. Another version is for the case
that parent inode is not locked. It checks parent/target inode and
dentry name after locking dentry->d_lock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 193e7b37 18-Apr-2019 David Disseldorp <ddiss@suse.de>

ceph: carry snapshot creation time with inodes

MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3e1d0452 18-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: avoid iput_final() while holding mutex or in dispatch thread

iput_final() may wait for reahahead pages. The wait can cause deadlock.
For example:

Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
schedule+0x36/0x80
io_schedule+0x16/0x40
__lock_page+0x101/0x140
truncate_inode_pages_range+0x556/0x9f0
truncate_inode_pages_final+0x4d/0x60
evict+0x182/0x1a0
iput+0x1d2/0x220
iterate_session_caps+0x82/0x230 [ceph]
dispatch+0x678/0xa80 [ceph]
ceph_con_workfn+0x95b/0x1560 [libceph]
process_one_work+0x14d/0x410
worker_thread+0x4b/0x460
kthread+0x105/0x140
ret_from_fork+0x22/0x40

Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
__schedule+0x3d6/0x8b0
schedule+0x36/0x80
schedule_preempt_disabled+0xe/0x10
mutex_lock+0x2f/0x40
ceph_check_caps+0x505/0xa80 [ceph]
ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
writepages_finish+0x2d3/0x410 [ceph]
__complete_request+0x26/0x60 [libceph]
handle_reply+0x6c8/0xa10 [libceph]
dispatch+0x29a/0xbb0 [libceph]
ceph_con_workfn+0x95b/0x1560 [libceph]
process_one_work+0x14d/0x410
worker_thread+0x4b/0x460
kthread+0x105/0x140
ret_from_fork+0x22/0x40

In above example, truncate_inode_pages_range() waits for readahead pages
while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
OSD dispatch thread. Later OSD replies (for readahead) can't be handled.

ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
can happen if iput_final() is called while holding snap_rwsem.

In general, it's not good to call iput_final() inside MDS/OSD dispatch
threads or while holding any mutex.

The fix is introducing ceph_async_iput(), which calls iput_final() in
workqueue.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1cf89a8d 17-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: single workqueue for inode related works

We have three workqueue for inode works. Later patch will introduce
one more work for inode. It's not good to introcuce more workqueue
and add more 'struct work_struct' to 'struct ceph_inode_info'.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 428bb68a 11-Apr-2019 Jeff Layton <jlayton@kernel.org>

ceph: properly handle granular statx requests

cephfs can benefit from statx. We can have the client just request caps
sufficient for the needed attributes and leave off the rest.

Also, recognize when AT_STATX_DONT_SYNC is set, and just scrape the
inode without doing any call in that case. Force a call to the MDS in
the event that AT_STATX_FORCE_SYNC is set.

Link: http://tracker.ceph.com/issues/39258
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cfa6d412 10-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

ceph: use ->free_inode()

a lot of non-delayed work in this case; all of that is left in
->destroy_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4b822287 14-Apr-2019 Jeff Layton <jlayton@kernel.org>

ceph: handle the case where a dentry has been renamed on outstanding req

It's possible for us to issue a lookup to revalidate a dentry
concurrently with a rename. If done in the right order, then we could
end up processing dentry info in the reply that no longer reflects the
state of the dentry.

If req->r_dentry->d_name differs from the one in the trace, then just
ignore the trace in the reply. We only need to do this however if the
parent's i_rwsem is not held.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# daf5cc27 25-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

ceph: fix use-after-free on symlink traversal

free the symlink body after the same RCU delay we have for freeing the
struct inode itself, so that traversal during RCU pathwalk wouldn't step
into freed memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 37c4efc1 31-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: periodically trim stale dentries

Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.

There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1e9c2eb6 28-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: delete stale dentry when last reference is dropped

introduce ceph_d_delete(), which checks if dentry has valid lease.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e3ec8d68 14-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: send cap releases more aggressively

When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 08796873 08-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: support getting ceph.dir.pin vxattr

Link: http://tracker.ceph.com/issues/37576
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 75c9627e 14-Dec-2017 Yan, Zheng <zyan@redhat.com>

ceph: map snapid to anonymous bdev ID

ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.

This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.

Link: http://tracker.ceph.com/issues/22353
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5ba72e60 17-Dec-2018 Yan, Zheng <zyan@redhat.com>

ceph: set special inode's blocksize to page size

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2bf996ac 25-Oct-2018 Yan, Zheng <zyan@redhat.com>

ceph: cleanup splice_dentry()

splice_dentry() may drop the original dentry and return other
dentry. It relies on its caller to update pointer that points
to the dropped dentry. This is error-prone.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 74c9e6bf 27-Sep-2018 Yan, Zheng <zyan@redhat.com>

ceph: check if LOOKUPNAME request was aborted when filling trace

d_lookup()/d_alloc() require parent inode locked. Parent inode is
not locked if request is aborted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c58f450b 27-Sep-2018 Yan, Zheng <zyan@redhat.com>

ceph: fix dentry leak in ceph_readdir_prepopulate

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# efe32823 27-Sep-2018 Yan, Zheng <zyan@redhat.com>

Revert "ceph: fix dentry leak in splice_dentry()"

This reverts commit 8b8f53af1ed9df88a4c0fbfdf3db58f62060edf3.

splice_dentry() is used by three places. For two places, req->r_dentry
is passed to splice_dentry(). In the case of error, req->r_dentry does
not get updated. So splice_dentry() should not drop reference.

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 36a4c72d 05-Aug-2018 Chengguang Xu <cgxu519@gmx.com>

ceph: add additional size check in ceph_setattr()

ceph_setattr() finally calls vfs function inode_newsize_ok()
to do offset validation and that is based on sb->s_maxbytes.
Because we set sb->s_maxbytes to MAX_LFS_FILESIZE to through
VFS check and do proper offset validation in cephfs level,
we need adding proper offset validation before calling
inode_newsize_ok().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0ed1e90a 13-Jul-2018 Arnd Bergmann <arnd@arndb.de>

ceph: use timespec64 for r_stamp

The ceph_mds_request stamp still uses the deprecated timespec structure,
this converts it over as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 9bbeab41 13-Jul-2018 Arnd Bergmann <arnd@arndb.de>

ceph: use timespec64 for inode timestamp

Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.

In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8b8f53af 19-Jun-2018 Yan, Zheng <zyan@redhat.com>

ceph: fix dentry leak in splice_dentry()

In any case, d_splice_alias() does not drop reference of original
dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 95582b00 08-May-2018 Deepa Dinamani <deepa.kernel@gmail.com>

vfs: change inode times to use struct timespec64

struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}

@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}

@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>


# aae1a442 26-May-2018 Yan, Zheng <zyan@redhat.com>

ceph: prevent i_version from going back

inode info from non-auth can be stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8c6286f1 21-May-2018 Luis Henriques <lhenriques@suse.com>

ceph: fix st_nlink stat for directories

Currently, calling stat on a cephfs directory returns 1 for st_nlink.
This behaviour has recently changed in the fuse client, as some
applications seem to expect this value to be either 0 (if it's
unlinked) or 2 + number of subdirectories. This behaviour was changed
in the fuse client with commit 67c7e4619188 ("client: use common
interp of st_nlink for dirs").

This patch modifies the kernel client to have a similar behaviour.

Link: https://tracker.ceph.com/issues/23873
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2af54a72 26-Apr-2018 Yan, Zheng <zyan@redhat.com>

ceph: update i_files/i_subdirs only when Fs cap is issued

In MDS, file/subdir counts of a directory inode are protected by
filelock. In request reply without Fs cap, nfiles/nsubdirs can be
stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 49a9f4f6 25-Apr-2018 Yan, Zheng <zyan@redhat.com>

ceph: always get rstat from auth mds

rstat is not tracked by capability. client can't know if rstat from
non-auth mds is uptodate or not.

Link: http://tracker.ceph.com/issues/23538
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 13442b03 23-Apr-2018 Deepa Dinamani <deepa.kernel@gmail.com>

ceph: make inode time prints to be long long

Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.

Convert these print formats to use long long types to
avoid warnings and errors on conversion.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: zyan@redhat.com
Cc: ceph-devel@vger.kernel.org


# ffdeec7a 26-Mar-2018 Yan, Zheng <zyan@redhat.com>

ceph: always update atime/mtime/ctime for new inode

For new inode, atime/mtime/ctime are uninitialized. Don't compare
against them.

Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d557c48d 12-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: add counter for snaprealms with quota

By keeping a counter with the number of snaprealms that have quota set
allows to optimize the functions that need to walk throught the realms
hierarchy looking for quotas. Thus, if this counter is zero it's safe to
assume that there are no realms with quota.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e3161f17 12-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: cache inode pointer in ceph_snap_realm

Keep a pointer to the inode in struct ceph_snap_realm. This allows to
optimize functions that walk the realms hierarchy (e.g. in quotas).

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2b83845f 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: support for ceph.quota.max_bytes

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fb18a575 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: add initial infrastructure to support cephfs quotas

This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client. Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# af9cc401 04-Mar-2018 Yan, Zheng <zyan@redhat.com>

ceph: invalidate pages that beyond EOF in ceph_writepages_start()

Dirty pages can be associated with different capsnap. Different capsnap
may have different EOF value. So invalidating dirty pages according to
the largest EOF value is wrong. Dirty pages beyond EOF, but associated
with other capsnap, do not get invalidated.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5495c2d0 26-Nov-2017 Yan, Zheng <zyan@redhat.com>

ceph: avoid dereferencing invalid pointer during cached readdir

Readdir cache keeps array of dentry pointers in page cache. If any
dentry in readdir cache gets pruned, ceph_d_prune() disables readdir
cache for later readdir syscall. The problem is that ceph_d_prune()
ignores unhashed dentry. Ideally MDS should have already revoked
CEPH_CAP_FILE_SHARED (which also disables readdir cache) when dentry
gets unhashed. But if it is somehow MDS does not properly revoke
CEPH_CAP_FILE_SHARED and the unhashed dentry gets pruned later,
ceph_d_prune() will not disable readdir cache, later readdir may
reference invalid dentry pointer.

The fix is make ceph_d_prune() do extra check for unhashed dentry.
Disable readdir cache if the unhashed dentry is still referenced
by readdir cache.

Another fix in this patch is handle d_splice_alias(). If a dentry
gets spliced into new parent dentry, treat it as if it was pruned
(call ceph_d_prune() for it).

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 97aeb6bf 26-Nov-2017 Yan, Zheng <zyan@redhat.com>

ceph: use atomic_t for ceph_inode_info::i_shared_gen

It allows accessing i_shared_gen without holding i_ceph_lock. It is
preparation for later patch.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# be70489e 23-Nov-2017 Yan, Zheng <zyan@redhat.com>

ceph: properly drop caps for setattr request

For CEPH_SETATTR_ATIME, MDS needs to xlock filelock, Fsxrw caps
are not allowed for xlocked filelock.

For CEPH_SETATTR_SIZE request that truncates file to smaller size,
MDS needs to xlock filelock, Fsxrw caps are not allowed for xlocked
filelock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 81302565 30-Oct-2017 Jeff Layton <jlayton@kernel.org>

ceph: remove the bump of i_version

Eventually, we'll want to wire cephfs up to use the change attribute
that the cluster tracks instead, but for now this is unneeded.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 933ad2c9 10-Oct-2017 Yan, Zheng <zyan@redhat.com>

ceph: disable cached readdir after dropping positive dentry

Ideally CEPH_CAP_FILE_SHARED should have been revoked before
postive dentry get dropped. But if something goes wrong, later
cached readdir may dereference the dropped dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 89aa5930 08-Sep-2017 Yan, Zheng <zyan@redhat.com>

ceph: keep auth cap when inode has flocks or posix locks

file locks are tracked by inode's auth mds. dropping auth caps
is equivalent to releasing all file locks.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# c8fd0d37 28-Aug-2017 Yan, Zheng <zyan@redhat.com>

ceph: handle race between vmtruncate and queuing cap snap

It's possible that we create a cap snap while there is pending
vmtruncate (truncate hasn't been processed by worker thread).
We should truncate dirty pages beyond capsnap->size in that case.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d37b1d99 20-Aug-2017 Markus Elfring <elfring@users.sourceforge.net>

ceph: adjust 36 checks for NULL pointers

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written ...

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b529d1b3 20-Aug-2017 Markus Elfring <elfring@users.sourceforge.net>

ceph: delete an unnecessary return statement in update_dentry_lease()

The script "checkpatch.pl" pointed information out like the following.

WARNING: void function return statements are not generally useful

Thus remove such a statement in the affected function.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 51308806 20-Aug-2017 Markus Elfring <elfring@users.sourceforge.net>

ceph: ENOMEM pr_err in __get_or_create_frag() is redundant

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8d45b911 25-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: don't fill readdir cache for LSSNAP reply

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 9a86962b 25-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: cleanup ceph_readdir_prepopulate()

In LSSNAP case, req->r_dentry is already set to snapdir dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 481f001f 02-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: update ceph_dentry_info::lease_session when necessary

Current code does not update ceph_dentry_info::lease_session once
it is set. If auth mds of corresponding dentry changes, dentry lease
keeps in an invalid state.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# efb0ca76 21-May-2017 Yan, Zheng <zyan@redhat.com>

ceph: update the 'approaching max_size' code

The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 4ca2fea6 01-Jun-2017 Yan, Zheng <zyan@redhat.com>

ceph: unify inode i_ctime update

Current __ceph_setattr() can set inode's i_ctime to current_time(),
req->r_stamp or attr->ia_ctime. These time stamps may have minor
differences. It may cause potential problem.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 79162547 04-Apr-2017 Yan, Zheng <zyan@redhat.com>

ceph: make seeky readdir more efficient

Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8179a101 18-Apr-2017 Yan, Zheng <zyan@redhat.com>

ceph: fix recursion between ceph_set_acl() and __ceph_setattr()

ceph_set_acl() calls __ceph_setattr() if the setacl operation needs
to modify inode's i_mode. __ceph_setattr() updates inode's i_mode,
then calls posix_acl_chmod().

The problem is that __ceph_setattr() calls posix_acl_chmod() before
sending the setattr request. The get_acl() call in posix_acl_chmod()
can trigger a getxattr request. The reply of the getxattr request
can restore inode's i_mode to its old value. The set_acl() call in
posix_acl_chmod() sees old value of inode's i_mode, so it calls
__ceph_setattr() again.

Cc: stable@vger.kernel.org # needs backporting for < 4.9
Link: http://tracker.ceph.com/issues/19688
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 55f2a045 13-Feb-2017 Ilya Dryomov <idryomov@gmail.com>

ceph: remove special ack vs commit behavior

- ask for a commit reply instead of an ack reply in
__ceph_pool_perm_get()
- don't ask for both ack and commit replies in ceph_sync_write()
- since just only one reply is requested now, i_unsafe_writes list
will always be empty -- kill ceph_sync_write_wait() and go back to
a standard ->evict_inode()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# cdde7c43 27-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: call update_dentry_lease even when r_locked dir is not set

We don't really require that the parent be locked in order to update the
lease on a dentry. Lease info is protected by the d_lock. In the event
that the parent is not locked in ceph_fill_trace, and we have both
parent and target info, go ahead and update the dentry lease.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f5d55f03 27-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: vet the target and parent inodes before updating dentry lease

In a later patch, we're going to need to allow ceph_fill_trace to
update the dentry's lease when the parent is not locked. This is
potentially racy though -- by the time we get around to processing the
trace, the parent may have already changed.

Change update_dentry_lease to take a ceph_vino pointer and use that to
ensure that the dentry's parent still matches it before updating the
lease.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 80d025ff 26-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: don't update_dentry_lease unless we actually got one

This if block updates the dentry lease even in the case where
the MDS didn't grant one.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3dd69aab 31-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: add a new flag to indicate whether parent is locked

struct ceph_mds_request has an r_locked_dir pointer, which is set to
indicate the parent inode and that its i_rwsem is locked. In some
critical places, we need to be able to indicate the parent inode to the
request handling code, even when its i_rwsem may not be locked.

Most of the code that operates on r_locked_dir doesn't require that the
i_rwsem be locked. We only really need it to handle manipulation of the
dcache. The rest (filling of the inode, updating dentry leases, etc.)
already has its own locking.

Add a new r_req_flags bit that indicates whether the parent is locked
when doing the request, and rename the pointer to "r_parent". For now,
all the places that set r_parent also set this flag, but that will
change in a later patch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bc2de10d 01-Feb-2017 Jeff Layton <jlayton@kernel.org>

ceph: convert bools in ceph_mds_request to a new r_req_flags field

Currently, we have a bunch of bool flags in struct ceph_mds_request. We
need more flags though, but each bool takes (at least) a byte. Those
add up over time.

Merge all of the existing bools in this struct into a single unsigned
long, and use the set/test/clear_bit macros to manipulate them. These
are atomic operations, but that is required here to prevent
load/modify/store races. The existing flags are protected by different
locks, so we can't rely on them for that purpose.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f5a03b08 31-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: drop session argument to ceph_fill_trace

Just get it from r_session since that's what's always passed in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6fffaef9 31-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: remove "Debugging hook" from ceph_fill_trace

Keeping around commented out code is just asking for it to bitrot and
makes viewing the code under cscope more confusing. If
we really need this, then we can revert this patch and put it under a
Kconfig option.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 52953d55 26-Dec-2016 Seraphime Kirkovski <kirkseraph@gmail.com>

ceph: cleanup ACCESS_ONCE -> READ_ONCE

This removes the uses of ACCESS_ONCE in favor of READ_ONCE

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>


# fe2ed425 12-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: fix endianness bug in frag_tree_split_cmp

sparse says:

fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types)
fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a
fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag
fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types)
fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b
fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag

We need to convert these values to host-endian before calling the
comparator.

Fixes: a407846ef7c6 ("ceph: don't assume frag tree splits in mds reply are sorted")
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dfeef688 09-Dec-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove ".readlink = generic_readlink" assignments

If .readlink == NULL implies generic_readlink().

Generated by:

to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ad5cb123 28-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph: switch to use of ->d_init()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18fc8abd 28-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph: unify dentry_operations instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f72f9455 12-Oct-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix readdir vs fragmentation race

following sequence of events tigger the race

- client readdir frag 0* -> got item 'A'
- MDS merges frag 0* and frag 1*
- client send readdir request (frag 1*, offset 2, readdir_start 'A')
- MDS reply items (that are after item 'A') in frag *

Link: http://tracker.ceph.com/issues/17286
Signed-off-by: Yan, Zheng <zyan@redhat.com>


# fd50ecad 29-Sep-2016 Andreas Gruenbacher <agruenba@redhat.com>

vfs: Remove {get,set,remove}xattr inode operations

These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c2050a45 14-Sep-2016 Deepa Dinamani <deepa.kernel@gmail.com>

fs: Replace current_fs_time() with current_time()

current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 31051c85 26-May-2016 Jan Kara <jack@suse.cz>

fs: Give dentry to inode_change_ok() instead of inode

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# fd5472ed 26-May-2016 Jan Kara <jack@suse.cz>

ceph: Propagate dentry down to inode_change_ok()

To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. ceph_setattr() has the dentry easily available but
__ceph_setattr() is also called from ceph_set_acl() where dentry is not
easily available. Luckily that call path does not need inode_change_ok()
to be called anyway. So reorganize functions a bit so that
inode_change_ok() is called only from paths where dentry is available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# e4500b5e 05-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: use list instead of rbtree to track cap flushes

We don't have requirement of searching cap flush by TID. In most cases,
we just need to know TID of the oldest cap flush. List is ideal for this
usage.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 9b16f03c 22-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

ceph: don't use ->d_time

Pretty simple: just use ceph_dentry_info.time instead (which was already
there, unused).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 9a5530c6 15-Jun-2016 Yan, Zheng <zyan@redhat.com>

ceph: wait unsafe sync writes for evicting inode

Otherwise ceph_sync_write_unsafe() may access/modify freed inode.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 774a6a11 06-Jun-2016 Yan, Zheng <zyan@redhat.com>

ceph: reduce i_nr_by_mode array size

Track usage count for individual fmode bit. This can reduce the
array size by half.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 779fe0fb 06-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: rados pool namespace support

This patch adds codes that decode pool namespace information in
cap message and request reply. Pool namespace is saved in i_layout,
it will be passed to libceph when doing read/write.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 30c156d9 13-Feb-2016 Yan, Zheng <zyan@redhat.com>

libceph: rados pool namespace support

Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7627151e 03-Feb-2016 Yan, Zheng <zyan@redhat.com>

libceph: define new ceph_file_layout structure

Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8387ff25 10-Jun-2016 Linus Torvalds <torvalds@linux-foundation.org>

vfs: make the string hashes salt the hash

We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time. It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9abd4db7 18-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: don't use truncate_pagecache() to invalidate read cache

truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 224a7542 05-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: tolerate bad i_size for symlink inode

A mds bug can cause symlink's size to be truncated to zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 1b1bc16d 03-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: improve fragtree change detection

check if number of splits in i_fragtree is equal to number of splits
in mds reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# a4b7431f 03-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: keep leaf frag when updating fragtree

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 42172119 03-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix dir_auth check in ceph_fill_dirfrag()

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# a407846e 03-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: don't assume frag tree splits in mds reply are sorted

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 209ae762 29-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix inode reference leak

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f3c4ebe6 28-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: using hash value to compose dentry offset

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

(0xff << 52) | ((24 bits hash) << 28) |
(the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8974eebd 28-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: record 'offset' for each entry of readdir result

This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 2a5beea3 27-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: define struct for dir entry in readdir reply

This avoids defining multiple arrays for entries in readdir reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# a78600e7 27-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: simplify 'offset in frag'

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 6c93df5d 14-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: don't call truncate_pagecache in ceph_writepages_start

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 2cdeb1e4 13-Apr-2016 Andreas Gruenbacher <agruenba@redhat.com>

ceph: Switch to generic xattr handlers

Add a catch-all xattr handler at the end of ceph_xattr_handlers. Check
for valid attribute names there, and remove those checks from
__ceph_{get,set,remove}xattr instead. No "system.*" xattrs need to be
handled by the catch-all handler anymore.

The set xattr handler is called with a NULL value to indicate that the
attribute should be removed; __ceph_setxattr already handles that case
correctly (ceph_set_acl could already calling __ceph_setxattr with a NULL
value).

Move the check for snapshots from ceph_{set,remove}xattr into
__ceph_{set,remove}xattr. With that, ceph_{get,set,remove}xattr can be
replaced with the generic iops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a26fecca 13-Apr-2016 Andreas Gruenbacher <agruenba@redhat.com>

ceph: Get rid of d_find_alias in ceph_set_acl

Create a variant of ceph_setattr that takes an inode instead of a
dentry. Change __ceph_setxattr (and also __ceph_removexattr) to take an
inode instead of a dentry. Use those in ceph_set_acl so that we no
longer need a dentry there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 200fd27c 17-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: use lookup request to revalidate dentry

If dentry has no lease, ceph_d_revalidate() previously return 0.
This causes VFS to invalidate the dentry and create a new dentry
for later lookup. Invalidating a dentry also detach any underneath
mount points. So mount point inside cephfs can disapear mystically
(even the mount point is not modified by other hosts).

The fix is using lookup request to revalidate dentry without lease.
This can partly solve the mount points disapear issue (as long as
the mount point is not modified by other hosts)

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 315f2408 06-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix security xattr deadlock

When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# a3d714c3 26-Feb-2016 Yan, Zheng <zyan@redhat.com>

ceph: avoid updating directory inode's i_size accidentally

Directory inode's i_size is used by readdir cache.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# af5e5eb5 26-Feb-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix race during filling readdir cache

Readdir cache uses page cache to save dentry pointers. When adding
dentry pointers to middle of a page, we need to make sure the page
already exists. Otherwise the beginning part of the page will be
invalid pointers.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8bbd4714 02-Feb-2016 Deepa Dinamani <deepa.kernel@gmail.com>

ceph: replace CURRENT_TIME by current_fs_time()

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f8b31710 07-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)

... and use d_add(dn, NULL) in case we need to hash a negative
unhashed rather than using d_rehash() directly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f7380af0 06-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph: don't bother with d_rehash() in splice_dentry()

d_splice_alias() guarantees that it'll be always hashed

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5ea5c5e0 14-Feb-2016 Yan, Zheng <zyan@redhat.com>

ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support

Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 99c88e69 29-Dec-2015 Yan, Zheng <zyan@redhat.com>

ceph: use i_size_{read,write} to get/set i_size

Cap message from MDS can update i_size. In that case, we don't
hold i_mutex. So it's unsafe to directly access inode->i_size
while holding i_mutex.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 6b255391 17-Nov-2015 Al Viro <viro@zeniv.linux.org.uk>

replace ->follow_link() with new method that could stay in RCU mode

new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 68cd5b4b 27-Oct-2015 Yan, Zheng <zyan@redhat.com>

ceph: make fsync() wait unsafe requests that created/modified inode

If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# fdd4e158 16-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: rework dcache readdir

Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f66fd9f0 10-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: pre-allocate data structure that tracks caps flushing

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8310b089 09-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: track pending caps flushing globally

So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.

Tracking pending caps flushing globally also simplifies syncfs code.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 553adfd9 09-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: track pending caps flushing accurately

Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.

This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 604d1b02 01-May-2015 Yan, Zheng <zyan@redhat.com>

ceph: take snap_rwsem when accessing snap realm's cached_context

When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps()
accesses snap realm's cached_context. So we need take read lock
of snap_rwsem.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 10183a69 27-Apr-2015 Yan, Zheng <zyan@redhat.com>

ceph: check OSD caps before read/write

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# ac194dcc 02-May-2015 Al Viro <viro@zeniv.linux.org.uk>

ceph: switch to simple_follow_link()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2b0143b5 17-Mar-2015 David Howells <dhowells@redhat.com>

VFS: normal filesystems (and lustre): d_inode() annotations

that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5cba372c 01-Feb-2015 Yan, Zheng <zyan@redhat.com>

ceph: fix dentry leaks

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 2f92b3d0 18-Jan-2015 Yan, Zheng <zyan@redhat.com>

ceph: properly mark empty directory as complete

ceph_add_cap() calls __check_cap_issue(), which clears directory
inode' complete flag. so we should set the complete flag for empty
directory should be set after calling ceph_add_cap().

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 38c48b5f 13-Jan-2015 Yan, Zheng <zyan@redhat.com>

ceph: provide seperate {inode,file}_operations for snapdir

remove all unsupported operations from {inode,file}_operations.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# b83ae6d4 14-Jan-2015 Christoph Hellwig <hch@lst.de>

fs: remove mapping->backing_dev_info

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 01deead0 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: use getattr request to fetch inline data

Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 31c542a1 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: add inline data to pagecache

Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 70db4f36 21-Oct-2014 Yan, Zheng <zyan@redhat.com>

ceph: introduce a new inode flag indicating if cached dentries are ordered

After creating/deleting/renaming file, offsets of sibling dentries may
change. So we can not use cached dentries to satisfy readdir. But we can
still use the cached dentries to conclude -ENOENT for lookup.

This patch introduces a new inode flag indicating if child dentries are
ordered. The flag is set at the same time marking a directory complete.
After creating/deleting/renaming file, we clear the flag on directory
inode. This prevents ceph_readdir() from using cached dentries to satisfy
readdir syscall.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# a455589f 21-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

assorted conversions to %p[dD]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 41d28bca 12-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

switch d_materialise_unique() users to d_splice_alias()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 946e51f2 26-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

move d_rcu from overlapping d_child to overlapping d_alias

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a4483e8a 17-Sep-2014 Chao Yu <chao@kernel.org>

ceph: remove redundant code for max file size verification

Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>


# 508b32d8 16-Sep-2014 Yan, Zheng <zyan@redhat.com>

ceph: request xattrs if xattr_version is zero

Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply
doesn't contain xattrs because MDS didn't receive the cap release
message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 4e217b5d 07-Jun-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use truncate_pagecache() instead of truncate_inode_pages()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# f3ae1b97 06-Jun-2014 Fabian Frederick <fabf@skynet.be>

fs/ceph: replace pr_warning by pr_warn

Update the last pr_warning callsites in fs branch

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8d08503c 18-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: remember subtree root dirfrag's auth MDS

remember dirfrag's auth MDS when it's different from its parent inode's
auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 3e7fbe9c 18-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: introduce ceph_fill_fragtree()

Move the code that update the i_fragtree into a separate function.
Also add simple probabilistic test to decide whether the i_fragtree
should be updated

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# d9df2783 17-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: pre-allocate ceph_cap struct for ceph_add_cap()

So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# f98a128a 16-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: update inode fields according to issued caps

Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 0a8a70f9 13-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: clear directory's completeness when creating file

When creating a file, ceph_set_dentry_offset() puts the new dentry
at the end of directory's d_subdirs, then set the dentry's offset
based on directory's max offset. The offset does not reflect the
real postion of the dentry in directory. Later readdir reply from
MDS may change the dentry's position/offset. This inconsistency
can cause missing/duplicate entries in readdir result if readdir
is partly satisfied by dcache_readdir().

The fix is clear directory's completeness after creating/renaming
file. It prevents later readdir from using dcache_readdir().

Fixes: http://tracker.ceph.com/issues/8025
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 48193012 01-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: don't grabs open file reference for aborted request

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 5f75ce57 21-Mar-2014 Fabian Frederick <fabf@skynet.be>

ceph: Remove get/set acl on symlinks

Remove unsupported symlink operations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>


# 8c93cd61 08-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: update i_max_size even if inode version does not change

handle following sequence of events:
- client releases a inode with i_max_size > 0. The release message
is queued. (is not sent to the auth MDS)
- a 'lookup' request reply from non-auth MDS returns the same inode.
- client opens the inode in write mode. The version of inode trace
in 'open' request reply is equal to the cached inode's version.
- client requests new max size. The MDS ignores the request because
it does not affect client's write range

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 19913b4e 06-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: add get_name() NFS export callback

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 752c8bdc 05-Feb-2013 Sage Weil <sage@inktank.com>

ceph: do not chain inode updates to parent fsync

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>


# 72466d0b 29-Jan-2014 Sage Weil <sage@inktank.com>

ceph: fix posix ACL hooks

The merge of commit 7221fe4c2ed7 ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957d0
"fs: add generic xattr_acl handlers" and others).

Some of the fallout was fixed in commit 4db658ea0ca ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4db658ea 28-Jan-2014 Linus Torvalds <torvalds@linux-foundation.org>

ceph: Fix up after semantic merge conflict

The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2ed7 ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957d0 "fs: add generic
xattr_acl handlers", 5bf3258fd2ac "fs: make posix_acl_chmod more useful"
and 37bc15392a23 "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict. And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way. And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too. But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this. Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11df2dfb 23-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: add imported caps when handling cap export message

Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 9563f88c 21-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix cache revoke race

handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 7221fe4c 11-Nov-2013 Guangliang Zhao <lucienchao@gmail.com>

ceph: add acl for cephfs

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>


# 9f12bd11 20-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: drop unconnected inodes

Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 86b58d13 04-Dec-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: initialize inode before instantiating dentry

commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.

Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 81c6aea5 17-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: handle frag mismatch between readdir request and reply

If client has outdated directory fragments information, it may request
readdir an non-existent directory fragment. In this case, the MDS finds
an approximate directory fragment and sends its contents back to the
client. When receiving a reply with fragment that is different than the
requested one, the client need to reset the 'readdir offset'.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 53e879a4 17-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: remove outdated frag information

If directory fragments change, fill_inode() inserts new frags into
the fragtree, but it does not remove outdated frags from the fragtree.
This patch fixes it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# ed284c49 02-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: remove ceph_lookup_inode()

commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
which provides similar function. So remove ceph_lookup_inode(), use
ceph_find_inode() instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
Reviewed-by: Sage Weil <sage@inktank.com>


# 99ccbd22 21-Aug-2013 Milosz Tanski <milosz@adfin.com>

ceph: use fscache as a local presisent cache

Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>


# b0d7c223 12-Aug-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: introduce i_truncate_mutex

I encountered below deadlock when running fsstress

wmtruncate work truncate MDS
--------------- ------------------ --------------------------
lock i_mutex
<- truncate file
lock i_mutex (blocked)
<- revoking Fcb (filelock to MIX)
send request ->
handle request (xlock filelock)

At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.

When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.

The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 6f60f889 23-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix freeing inode vs removing session caps race

remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.

We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 85ce127a 21-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: wake up writer if vmtruncate work get blocked

To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 84d08fa8 05-Jul-2013 Al Viro <viro@zeniv.linux.org.uk>

helper for reading ->d_count

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b415bf4f 01-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix pending vmtruncate race

The locking order for pending vmtruncate is wrong, it can lead to
following race:

write wmtruncate work
------------------------ ----------------------
lock i_mutex
check i_truncate_pending check i_truncate_pending
truncate_inode_pages() lock i_mutex (blocked)
copy data to page cache
unlock i_mutex
truncate_inode_pages()

The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()

Fixes: http://tracker.ceph.com/issues/5453
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 0b932672 07-Apr-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix symlink inode operations

add getattr/setattr and xattrs related methods.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 2f276c51 13-Mar-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use i_release_count to indicate dir's completeness

Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.

This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 3f99969f 28-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: acquire i_mutex in __ceph_do_pending_vmtruncate

make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# a8673d61 18-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag

commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 79f9f99a 29-Jan-2013 Sage Weil <sage@inktank.com>

ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache. However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bd2bae6a 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

ceph: Convert kuids and kgids before printing them.

Before printing kuid and kgids values convert them into
the initial user namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# ab871b90 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

ceph: Translate inode uid and gid attributes to/from kuids and kgids.

- In fill_inode() transate uids and gids in the initial user namespace
into kuids and kgids stored in inode->i_uid and inode->i_gid.

- In ceph_setattr() if they have changed convert inode->i_uid and
inode->i_gid into initial user namespace uids and gids for
transmission.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# a85f50b6 18-Nov-2012 Yan, Zheng <zheng.z.yan@intel.com>

ceph: Fix __ceph_do_pending_vmtruncate

we should set i_truncate_pending to 0 after page cache is truncated
to i_truncate_size

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>


# 2744c171 26-Sep-2012 Al Viro <viro@zeniv.linux.org.uk>

ceph: don't abuse d_delete() on failure exits

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6c5e50fa 21-Aug-2012 Sage Weil <sage@inktank.com>

ceph: tolerate (and warn on) extraneous dentry from mds

If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>


# 810339ec 03-Feb-2012 Xi Wang <xi.wang@gmail.com>

ceph: avoid panic with mismatched symlink sizes in fill_inode()

Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size
do not match.

Also use kstrndup rather than kmalloc/memcpy.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>


# b8cd952b 13-Dec-2011 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: dereference pointer after checking for NULL

moved dereference after BUG_ON

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>


# 6b520e05 12-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

vfs: fix the stupidity with i_dentry in inode destructors

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be655596 30-Nov-2011 Sage Weil <sage@newdream.net>

ceph: use i_ceph_lock instead of i_lock

We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction. That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.

Changing the list lock ordering would be a painful process.

However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# 15a2015f 05-Nov-2011 Sage Weil <sage@newdream.net>

ceph: fix iput race when queueing inode work

If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.

That is, instead of

if (queue_work(...))
ihold();

we need to do

ihold();
if (!queue_work(...))
iput();

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# c6ffe100 03-Nov-2011 Sage Weil <sage@newdream.net>

ceph: use new D_COMPLETE dentry flag

We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

Signed-off-by: Sage Weil <sage@newdream.net>


# bfe86848 28-Oct-2011 Miklos Szeredi <mszeredi@suse.cz>

filesystems: add set_nlink()

Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 83eaea22 24-Aug-2011 Sage Weil <sage@newdream.net>

Revert "ceph: don't truncate dirty pages in invalidate work thread"

This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6.

We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:

- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data

It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.

Signed-off-by: Sage Weil <sage@newdream.net>


# 4f177264 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: document locking for ceph_set_dentry_offset

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 5f21c96d 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: protect access to d_parent

d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# dfabbed6 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: set dir complete frag after adding capability

Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 2f90b852 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: ignore lease mask

The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 10556cb2 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to ->permission()

not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2830ba7f 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to generic_permission()

redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 178ea735 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

kill check_acl callback of generic_permission()

its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 70b666c3 27-May-2011 Sage Weil <sage@newdream.net>

ceph: use ihold when we already have an inode ref

We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.

Signed-off-by: Sage Weil <sage@newdream.net>


# d3d0720d 11-May-2011 Henry C Chang <henry.cy.chang@gmail.com>

ceph: do not use i_wrbuffer_ref as refcount for Fb cap

We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# fca65b4a 04-May-2011 Sage Weil <sage@newdream.net>

ceph: do not call __mark_dirty_inode under i_lock

The __mark_dirty_inode helper now takes i_lock as of 250df6ed. Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>


# ad1fee96 21-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: add ino32 mount option

The ino32 mount option forces the ceph fs to report 32 bit
ino values. This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>


# 09adc80c 04-Feb-2011 Sage Weil <sage@newdream.net>

ceph: preserve I_COMPLETE across rename

d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets. We were just clearing I_COMPLETE to avoid
any possibility of trouble. However, assigning the renamed dentry an
offset at the end of the directory (to match it's new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.

This is especially important for workloads like rsync, which renames files
into place. Before, we would lose I_COMPLETE and do MDS lookups for each
file. With this patch we only talk to the MDS on create and rename.

Signed-off-by: Sage Weil <sage@newdream.net>


# b545cc15 28-Feb-2011 Sage Weil <sage@newdream.net>

ceph: do not set I_COMPLETE

Do not set the I_COMPLETE flag on directories until we resolve races with
dcache pruning.

Signed-off-by: Sage Weil <sage@newdream.net>


# 1c1266bb 12-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: fix getattr on directory when using norbytes

The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 14303d20 14-Dec-2010 Sage Weil <sage@newdream.net>

ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS

This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS. This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests. This will result in a forward
and degrade performance. It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>


# 6c0f3af7 16-Nov-2010 Sage Weil <sage@newdream.net>

ceph: add dir_layout to inode

Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>


# b74c79e9 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: provide rcu-walk aware permission i_ops

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# fa0d7e3d 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: icache RCU free inodes

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b5c84bf6 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache remove dcache_lock

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 2fd6b7f5 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache scale subdirs

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b7ab39f6 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache scale dentry refcount

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 451a3c24 17-Nov-2010 Arnd Bergmann <arnd@arndb.de>

BKL: remove extraneous #include <smp_lock.h>

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b7495fc2 09-Nov-2010 Sage Weil <sage@newdream.net>

ceph: make page alignment explicit in osd interface

We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>


# d8672d64 08-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix update of ctime from MDS

The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).

Signed-off-by: Sage Weil <sage@newdream.net>


# 8bd59e01 08-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix version check on racing inode updates

We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have. The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied. Fixed and clarified the comment.

Signed-off-by: Sage Weil <sage@newdream.net>


# cd045cb4 04-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix rdcache_gen usage and invalidate

We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>


# 912a9b03 07-Nov-2010 Sage Weil <sage@newdream.net>

ceph: only let auth caps update max_size

Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>


# d8b16b3d 06-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix bad pointer dereference in ceph_fill_trace

We dereference *in a few lines down, but only set it on rename. It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>


# 3d14c5d2 06-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>


# 467c5251 13-Sep-2010 Sage Weil <sage@newdream.net>

ceph: fix dn offset during readdir_prepopulate

When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset. This can cause the readdir result offsets
to get out of sync with the server. Add an argument to the helper so
that it does not.

This bug was introduced by 1cd3935bedccf592d44343890251452a6dd74fc4.

Signed-off-by: Sage Weil <sage@newdream.net>


# ac1f12ef 25-Aug-2010 Dan Carpenter <error27@gmail.com>

ceph: ceph_get_inode() returns an ERR_PTR

ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 12451491 22-Aug-2010 Sage Weil <sage@newdream.net>

ceph: don't improperly set dir complete when holding EXCL cap

If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty. If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>


# 2962507c 27-May-2010 Sage Weil <sage@newdream.net>

ceph: perform lazy reads when file mode and caps permit

If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it. Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.

Signed-off-by: Sage Weil <sage@newdream.net>


# 03066f23 27-Jul-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: use complete_all and wake_up_all

This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 8c696737 22-Jul-2010 Sage Weil <sage@newdream.net>

ceph: fix leak of dentry in ceph_init_dentry() error path

If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>


# d69ed05a 21-Jun-2010 Sage Weil <sage@newdream.net>

ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate

Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing. (Also, report the error code.)

Signed-off-by: Sage Weil <sage@newdream.net>


# 13a4214c 01-Jun-2010 Henry C Chang <henry_c_chang@tcloudcomputing.com>

ceph: fix d_subdirs ordering problem

We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.

This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 7e34bc52 21-May-2010 Julia Lawall <julia@diku.dk>

fs/ceph: Use ERR_CAST

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>


# 167c9e35 14-May-2010 Sage Weil <sage@newdream.net>

ceph: use common helper for aborted dir request invalidation

We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay. Use common helper to avoid duplicate code.

Signed-off-by: Sage Weil <sage@newdream.net>


# 1cd3935b 03-May-2010 Sage Weil <sage@newdream.net>

ceph: set dn offset when spliced

We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry(). Notably, we should NOT assign an
offset when a dentry is first created and is still null.

BUG if we try to splice a non-null dentry (we shouldn't).

Signed-off-by: Sage Weil <sage@newdream.net>


# 1b7facc4 16-Apr-2010 Sage Weil <sage@newdream.net>

ceph: don't clobber i_max_offset on already complete dir

This can screw up offsets assigned to new dentries and break dcache
readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>


# e8a74987 15-Apr-2010 Sage Weil <sage@newdream.net>

ceph: skip set_dentry_offset work if directory not I_COMPLETE

Signed-off-by: Sage Weil <sage@newdream.net>


# a6424e48 29-Apr-2010 Sage Weil <sage@newdream.net>

ceph: fix xattr dangling pointer / double free

If we use the xattr_blob, clear the pointer so we don't release the memory
at the bottom of the fuction.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 640ef79d 26-Mar-2010 Cheng Renquan <crquan@gmail.com>

ceph: use ceph_sb_to_client instead of ceph_client

ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.

-static inline struct ceph_client *ceph_client(struct super_block *sb)
-{
- return sb->s_fs_info;
-}

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 81a6cf2d 14-May-2010 Sage Weil <sage@newdream.net>

ceph: invalidate affected dentry leases on aborted requests

If we abort a request, we return to caller, but the request may still
complete. And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request. A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').

Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases. Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.

Signed-off-by: Sage Weil <sage@newdream.net>


# 04d000eb 07-May-2010 Sage Weil <sage@newdream.net>

ceph: fix open file counting on snapped inodes when mds returns no caps

It's possible the MDS will not issue caps on a snapped inode, in which case
an open request may not __ceph_get_fmode(), botching the open file
counting. (This is actually a server bug, but the client shouldn't BUG out
in this case.)

Signed-off-by: Sage Weil <sage@newdream.net>


# c10f5e12 16-Apr-2010 Sage Weil <sage@newdream.net>

ceph: clear dir complete on d_move

d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.

Signed-off-by: Sage Weil <sage@newdream.net>


# 9358c6d4 30-Mar-2010 Sage Weil <sage@newdream.net>

ceph: fix dentry rehashing on virtual .snap dir

If a lookup fails on the magic .snap directory, we bind it to a magic
snap directory inode in ceph_lookup_finish(). That code assumes the dentry
is unhashed, but a recent server-side change started returning NULL leases
on lookup failure, causing the .snap dentry to be hashed and NULL by
ceph_fill_trace().

This causes dentry hash chain corruption, or a dies when d_rehash()
includes
BUG_ON(!d_unhashed(entry));

So, avoid processing the NULL dentry lease if it the dentry matches the
snapdir name in ceph_fill_trace(). That allows the lookup completion to
properly bind it to the snapdir inode. BUG there if dentry is hashed to
be sure.

Signed-off-by: Sage Weil <sage@newdream.net>


# 8b218b8a 09-Mar-2010 Sage Weil <sage@newdream.net>

ceph: fix inode removal from snap realm when racing with migration

When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true. This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.

Fix this by dropping any residual i_snap_realm ref in destroy_inode.

Signed-off-by: Sage Weil <sage@newdream.net>


# c9af9fb6 18-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: don't truncate dirty pages in invalidate work thread

Instead of truncating the whole range of pages, we skip those
pages that are dirty or in the middle of writeback. Those pages
will be cleared later when the writeback completes.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 2c27c9a5 17-Feb-2010 Sage Weil <sage@newdream.net>

ceph: fix typo in ceph_queue_writeback debug output

Signed-off-by: Sage Weil <sage@newdream.net>


# 3c6f6b79 09-Feb-2010 Sage Weil <sage@newdream.net>

ceph: cleanup async writeback, truncation, invalidate helpers

Grab inode ref in helper. Make work functions static, with consistent
naming.

Signed-off-by: Sage Weil <sage@newdream.net>


# 3d497d85 09-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: fix truncation when not holding caps

A truncation should occur when either we have the
specified caps for the file, or (in cases where we are
not the only ones referencing the file) when it is mapped
or when it is opened. The latter two cases were not
handled.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 0f26c4b2 29-Jan-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: remove unreachable code

We never truncate to a smaller size without contacting the MDS.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 5b1daecd 25-Jan-2010 Sage Weil <sage@newdream.net>

ceph: properly handle aborted mds requests

Previously, if the MDS request was interrupted, we would unregister the
request and ignore any reply. This could cause the caps or other cache
state to become out of sync. (For instance, aborting dbench and doing
rm -r on clients would complain about a non-empty directory because the
client didn't realize it's aborted file create request completed.)

Even we don't unregister, we still can't process the reply normally because
we are no longer holding the caller's locks (like the dir i_mutex).

So, mark aborted operations with r_aborted, and in the reply handler, be
sure to process all the caps. Do not process the namespace changes,
though, since we no longer will hold the dir i_mutex. The dentry lease
state can also be ignored as it's more forgiving.

Signed-off-by: Sage Weil <sage@newdream.net>


# 4baa75ef 07-Jan-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: change dentry offset and position after splice_dentry

This fixes a bug, where we had the parent list have dentries with
offsets that are not monotonically increasing, which caused the ceph
dcache_readdir to skip entries.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# c4a29f26 21-Dec-2009 Sage Weil <sage@newdream.net>

ceph: ensure rename target dentry fails revalidation

This works around a bug in vfs_rename_dir() that rehashes the target
dentry. Ensure such dentries always fail revalidation by timing out the
dentry lease and kicking it out of the current directory lease gen.

This can be reverted when the vfs bug is fixed.

Signed-off-by: Sage Weil <sage@newdream.net>


# b6c1d5b8 07-Dec-2009 Sage Weil <sage@newdream.net>

ceph: simplify ceph_buffer interface

We never allocate the ceph_buffer and buffer separtely, so use a single
constructor.

Disallow put on NULL buffer; make the caller check.

Signed-off-by: Sage Weil <sage@newdream.net>


# b377ff13 11-Nov-2009 Sage Weil <sage@newdream.net>

ceph: initialize i_size/i_rbytes on snapdir

Signed-off-by: Sage Weil <sage@newdream.net>


# 232d4b01 21-Oct-2009 Sage Weil <sage@newdream.net>

ceph: move directory size logic to ceph_getattr

We can't fill i_size with rbytes at the fill_file_size stage without
adding additional checks for directories. Notably, we want st_blocks
to remain 0 on directories so that 'du' still works.

Fill in i_blocks, i_size specially in ceph_getattr instead.

Signed-off-by: Sage Weil <sage@newdream.net>


# 355da1eb 06-Oct-2009 Sage Weil <sage@newdream.net>

ceph: inode operations

Inode cache and inode operations. We also include routines to
incorporate metadata structures returned by the MDS into the client
cache, and some helpers to deal with file capabilities and metadata
leases. The bulk of that work is done by fill_inode() and
fill_trace().

Signed-off-by: Sage Weil <sage@newdream.net>