History log of /linux-master/fs/ceph/super.h
Revision Date Author Comments
# 07045648 03-Jan-2024 Xiubo Li <xiubli@redhat.com>

ceph: always check dir caps asynchronously

The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:

"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."

So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aaefabc4 06-Nov-2023 Xiubo Li <xiubli@redhat.com>

ceph: try to allocate a smaller extent map for sparse read

In fscrypt case and for a smaller read length we can predict the
max count of the extent map. And for small read length use cases
this could save some memories.

[ idryomov: squash into a single patch to avoid build break, drop
redundant variable in ceph_alloc_sparse_ext_map() ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 79c66a0c 07-Aug-2023 Alexander Mikhalitsyn <alexander@mihalicyn.com>

ceph: pass idmap to __ceph_setattr

Just pass down the mount's idmapping to __ceph_setattr,
because we will need it later.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 38d46409 11-Jun-2023 Xiubo Li <xiubli@redhat.com>

ceph: print cluster fsid and client global_id in all debug logs

Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.

At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5995d90d 11-Jun-2023 Xiubo Li <xiubli@redhat.com>

ceph: rename _to_client() to _to_fs_client()

We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 197b7d79 09-Jun-2023 Xiubo Li <xiubli@redhat.com>

ceph: pass the mdsc to several helpers

We will use the 'mdsc' to get the global_id in the following commits.

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 10f9fbe9 29-Sep-2023 Wedson Almeida Filho <walmeida@microsoft.com>

ceph: move ceph_xattr_handlers to .rodata

This makes it harder for accidental or malicious changes to
ceph_xattr_handlers at runtime.

Cc: Xiubo Li <xiubli@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-7-wedsonaf@gmail.com
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 1464de9f 09-May-2023 Xiubo Li <xiubli@redhat.com>

ceph: wait for OSD requests' callbacks to finish when unmounting

The sync_filesystem() will flush all the dirty buffer and submit the
osd reqs to the osdc and then is blocked to wait for all the reqs to
finish. But the when the reqs' replies come, the reqs will be removed
from osdc just before the req->r_callback()s are called. Which means
the sync_filesystem() will be woke up by leaving the req->r_callback()s
are still running.

This will be buggy when the waiter require the req->r_callback()s to
release some resources before continuing. So we need to make sure the
req->r_callback()s are called before removing the reqs from the osdc.

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

We need to increase the blocker counter to make sure all the osd
requests' callbacks have been finished just before calling the
kill_anon_super() when unmounting.

Link: https://tracker.ceph.com/issues/58126
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e3dfcab2 20-Dec-2022 Xiubo Li <xiubli@redhat.com>

ceph: drop messages from MDS when unmounting

When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

Later the kernel will crash when iput() the inodes and dereferencing
the "sb->s_master_keys", which has been released by the
generic_shutdown_super().

Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5c64737d 25-Aug-2022 Xiubo Li <xiubli@redhat.com>

ceph: add truncate size handling support for fscrypt

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.

The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d4d51887 25-Aug-2022 Xiubo Li <xiubli@redhat.com>

ceph: add object version support for sync read

Turn the guts of ceph_sync_read into a new helper that takes an inode
and an offset instead of a kiocb struct, and make ceph_sync_read call
the helper as a wrapper.

Make the new helper always return the last object's version.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6b5717bd 08-Sep-2020 Jeff Layton <jlayton@kernel.org>

ceph: implement -o test_dummy_encryption mount option

Add support for the test_dummy_encryption mount option. This allows us
to test the encrypted codepaths in ceph without having to manually set
keys, etc.

[ lhenriques: fix potential fsc->fsc_dummy_enc_policy memory leak in
ceph_real_mount() ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2d332d5b 27-Jul-2020 Jeff Layton <jlayton@kernel.org>

ceph: fscrypt_auth handling for ceph

Most fscrypt-enabled filesystems store the crypto context in an xattr,
but that's problematic for ceph as xatts are governed by the XATTR cap,
but we really want the crypto context as part of the AUTH cap.

Because of this, the MDS has added two new inode metadata fields:
fscrypt_auth and fscrypt_file. The former is used to hold the crypto
context, and the latter is used to track the real file size.

Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now,
we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt
context.

Allow the client to use a setattr request for setting the fscrypt_auth
field. Since this is not a standard setattr request from the VFS, we add
a new field to __ceph_setattr that carries ceph-specific inode attrs.

Have the set_context op do a setattr that sets the fscrypt_auth value,
and get_context just return the contents of that field (since it should
always be available).

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ec9595c0 26-Aug-2020 Jeff Layton <jlayton@kernel.org>

ceph: preallocate inode for ops that may create one

When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.

Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.

In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).

The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 03bc06c7 26-Feb-2022 Jeff Layton <jlayton@kernel.org>

ceph: add new mount option to enable sparse reads

Add a new mount option that has the client issue sparse reads instead of
normal ones. The callers now preallocate an sparse extent buffer that
the libceph receive code can populate and hand back after the operation
completes.

After a successful sparse read, we can't use the req->r_result value to
determine the amount of data "read", so instead we set the received
length to be from the end of the last extent in the buffer. Any
interstitial holes will have been filled by the receive code.

[ xiubli: fix a double free on req reported by Ilya ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 23ee27dc 10-May-2023 Xiubo Li <xiubli@redhat.com>

ceph: add a dedicated private data for netfs rreq

We need to save the 'f_ra.ra_pages' to expand the readahead window
later.

Cc: stable@vger.kernel.org
Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
Link: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
Link: https://www.spinics.net/lists/ceph-users/msg76183.html
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aaf67de7 18-Apr-2023 Xiubo Li <xiubli@redhat.com>

ceph: fix potential use-after-free bug when trimming caps

When trimming the caps and just after the 'session->s_cap_lock' is
released in ceph_iterate_session_caps() the cap maybe removed by
another thread, and when using the stale cap memory in the callbacks
it will trigger use-after-free crash.

We need to check the existence of the cap just after the 'ci->i_ceph_lock'
being acquired. And do nothing if it's already removed.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/43272
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 4609e1f1 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->permission() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 13e83a49 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->set_acl() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# b74d24f7 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->getattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1632a0f 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->setattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# a68e564a 31-Jan-2023 Xiubo Li <xiubli@redhat.com>

ceph: blocklist the kclient when receiving corrupted snap trace

When received corrupted snap trace we don't know what exactly has
happened in MDS side. And we shouldn't continue IOs and metadatas
access to MDS, which may corrupt or get incorrect contents.

This patch will just block all the further IO/MDS requests
immediately and then evict the kclient itself.

The reason why we still need to evict the kclient just after
blocking all the further IOs is that the MDS could revoke the caps
faster.

Link: https://tracker.ceph.com/issues/57686
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b38b17b6 31-Jan-2023 Xiubo Li <xiubli@redhat.com>

ceph: move mount state enum to super.h

These flags are only used in ceph filesystem in fs/ceph, so just
move it to the place it should be.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 461ab10e 16-Nov-2022 Xiubo Li <xiubli@redhat.com>

ceph: switch to vfs_inode_has_locks() to fix file lock bug

For the POSIX locks they are using the same owner, which is the
thread id. And multiple POSIX locks could be merged into single one,
so when checking whether the 'file' has locks may fail.

For a file where some openers use locking and others don't is a
really odd usage pattern though. Locks are like stoplights -- they
only work if everyone pays attention to them.

Just switch ceph_get_caps() to check whether any locks are set on
the inode. If there are POSIX/OFD/FLOCK locks on the file at the
time, we should set CHECK_FILELOCK, regardless of what fd was used
to set the lock.

Fixes: ff5d913dfc71 ("ceph: return -EIO if read/write against filp that lost file locks")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 68c62bee 17-Oct-2022 Xiubo Li <xiubli@redhat.com>

ceph: try to check caps immediately after async creating finishes

We should call the check_caps() again immediately after the async
creating finishes in case the MDS is waiting for caps revocation
to finish.

Link: https://tracker.ceph.com/issues/46904
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e4b731cc 17-Oct-2022 Xiubo Li <xiubli@redhat.com>

ceph: remove useless session parameter for check_caps()

The session parameter makes no sense any more.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 138060ba 23-Sep-2022 Christian Brauner <brauner@kernel.org>

fs: pass dentry to set acl method

The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 48490776 06-Jun-2022 Xiubo Li <xiubli@redhat.com>

ceph: don't get the inline data for new creating files

If the 'i_inline_version' is 1, that means the file is just new
created and there shouldn't have any inline data in it, we should
skip retrieving the inline data from MDS.

This also could help reduce possiblity of dead lock issue introduce
by the inline data and Fcr caps.

Gradually we will remove the inline feature from kclient after ceph's
scrub too have support to unline the inline data, currently this
could help reduce the teuthology test failures.

This is possiblly could also fix a bug that for some old clients if
they couldn't explictly uninline the inline data when writing, the
inline version will keep as 1 always. We may always reading non-exist
data from inline data.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 00061645 09-Jun-2022 Xiubo Li <xiubli@redhat.com>

ceph: update the auth cap when the async create req is forwarded

For async create we will always try to choose the auth MDS of frag
the dentry belonged to of the parent directory to send the request
and ususally this works fine, but if the MDS migrated the directory
to another MDS before it could be handled the request will be
forwarded. And then the auth cap will be changed.

We need to update the auth cap in this case before the request is
forwarded.

Link: https://tracker.ceph.com/issues/55857
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e19feff9 09-Jun-2022 Xiubo Li <xiubli@redhat.com>

ceph: make change_auth_cap_ses a global symbol

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 4868e537 09-May-2022 Xiubo Li <xiubli@redhat.com>

ceph: wait for the first reply of inflight async unlink

In async unlink case the kclient won't wait for the first reply
from MDS and just drop all the links and unhash the dentry and then
succeeds immediately.

For any new create/link/rename,etc requests followed by using the
same file names we must wait for the first reply of the inflight
unlink request, or the MDS possibly will fail these following
requests with -EEXIST if the inflight async unlink request was
delayed for some reasons.

And the worst case is that for the none async openc request it will
successfully open the file if the CDentry hasn't been unlinked yet,
but later the previous delayed async unlink request will remove the
CDenty. That means the just created file is possiblly deleted later
by accident.

We need to wait for the inflight async unlink requests to finish
when creating new files/directories by using the same file names.

Link: https://tracker.ceph.com/issues/55332
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 874c8ca1 09-Jun-2022 David Howells <dhowells@redhat.com>

netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context

While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:

In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.

[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 55ab5520 27-Apr-2022 Luís Henriques <lhenriques@suse.de>

ceph: fix statfs for subdir mounts

When doing a mount using as base a directory that has 'max_bytes' quotas
statfs uses that value as the total; if a subdirectory is used instead,
the same 'max_bytes' too in statfs, unless there is another quota set.

Unfortunately, if this subdirectory only has the 'max_files' quota set,
then statfs uses the filesystem total. Fix this by making sure we only
lookup realms that contain the 'max_bytes' quota.

Cc: Ryan Taylor <rptaylor@uvic.ca>
URL: https://tracker.ceph.com/issues/55090
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5eed80fb 20-Apr-2022 Xiubo Li <xiubli@redhat.com>

ceph: try to choose the auth MDS if possible for getattr

If any 'x' caps is issued we can just choose the auth MDS instead
of the random replica MDSes. Because only when the Locker is in
LOCK_EXEC state will the loner client could get the 'x' caps. And
if we send the getattr requests to any replica MDS it must auth pin
and tries to rdlock from the auth MDS, and then the auth MDS need
to do the Locker state transition to LOCK_SYNC. And after that the
lock state will change back.

This cost much when doing the Locker state transition and usually
will need to revoke caps from clients.

URL: https://tracker.ceph.com/issues/55240
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bc899ee1 29-Jun-2021 David Howells <dhowells@redhat.com>

netfs: Add a netfs inode context

Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

struct my_inode {
struct {
/* These must be contiguous */
struct inode vfs_inode;
struct netfs_i_context netfs_ctx;
};
};

The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:

struct netfs_i_context {
...
struct fscache_cookie *cache;
};

Three functions are provided to help with this:

(1) void netfs_i_context_init(struct inode *inode,
const struct netfs_request_ops *ops);

Initialise the netfs context and set the operations.

(2) struct netfs_i_context *netfs_i_context(struct inode *inode);

Find the netfs context from the VFS inode.

(3) struct inode *netfs_inode(struct netfs_i_context *ctx);

Find the VFS inode from the netfs context.

Changes
=======
ver #4)
- Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
cache is present[3].
- Fix netfs_skip_folio_read() to zero out all of the page, not just some
of it[3].

ver #3)
- Split out the bit to move ceph cap-getting on readahead into
ceph_init_request()[1].
- Stick in a comment to the netfs inode structs indicating the contiguity
requirements[2].

ver #2)
- Adjust documentation to match.
- Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
- Move the cap check from ceph_readahead() to ceph_init_request() to be
called from netfslib.
- Remove ceph_readahead() and use netfs_readahead() directly instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com

Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4


# 5ed91587 22-Feb-2022 Xiubo Li <xiubli@redhat.com>

ceph: do not release the global snaprealm until unmounting

The global snaprealm would be created and then destroyed immediately
every time when updating it.

URL: https://tracker.ceph.com/issues/54362
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 74a31df4 18-Feb-2022 Xiubo Li <xiubli@redhat.com>

ceph: eliminate the recursion when rebuilding the snap context

Use a list instead of recursion to avoid possible stack overflow.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ab58a5a1 15-Feb-2022 Xiubo Li <xiubli@redhat.com>

ceph: move to a dedicated slabcache for ceph_cap_snap

There could be huge number of capsnaps around at any given time. On
x86_64 the structure is 248 bytes, which will be rounded up to 256 bytes
by kzalloc. Move this to a dedicated slabcache to save 8 bytes for each.

[ jlayton: use kmem_cache_zalloc ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6ddf5f16 13-Feb-2022 Milind Changire <milindchangire@gmail.com>

ceph: add getvxattr op

Problem:
Some directory vxattrs (e.g. ceph.dir.pin.random) are governed by
information that isn't necessarily shared with the client. Add support
for the new GETVXATTR operation, which allows the client to query the
MDS directly for vxattrs.
When the client is queried for a vxattr that doesn't have a special
handler, have it issue a GETVXATTR to the MDS directly.

Solution:
Adds new getvxattr op to fetch ceph.dir.pin*, ceph.dir.layout* and
ceph.file.layout* vxattrs.
If the entire layout for a dir or a file is being set, then it is
expected that the layout be set in standard JSON format. Individual
field value retrieval is not wrapped in JSON. The JSON format also
applies while setting the vxattr if the entire layout is being set in
one go.
As a temporary measure, setting a vxattr can also be done in the old
format. The old format will be deprecated in the future.

URL: https://tracker.ceph.com/issues/51062
Signed-off-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 083db6fd 15-Dec-2021 David Howells <dhowells@redhat.com>

ceph: uninline the data on a file opened for writing

If a ceph file is made up of inline data, uninline that in the ceph_open()
rather than in ceph_page_mkwrite(), ceph_write_iter(), ceph_fallocate() or
ceph_write_begin().

This makes it easier to convert to using the netfs library for VM write
hooks.

Should this also take the inode lock for the duration on uninlining to
prevent a race with truncation?

[ jlayton: fix up folio locking, update i_inline_version after write ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 503d4fa6 22-Mar-2022 NeilBrown <neilb@suse.de>

ceph: remove reliance on bdi congestion

The bdi congestion tracking in not widely used and will be removed.

CEPHfs is one of a small number of filesystems that uses it, setting just
the async (write) congestion flags at what it determines are appropriate
times.

The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flag, set an internal flag and change:

- .writepages to do nothing if WB_SYNC_NONE and the flag is set

- .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the
flag is set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) wil further delay the next attempt
at writeout. This might be a good thing.

Link: https://lkml.kernel.org/r/164549983739.9187.14895675781408171186.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a0b3a15e 10-Jan-2022 Jeff Layton <jlayton@kernel.org>

ceph: move CEPH_SUPER_MAGIC definition to magic.h

The uapi headers are missing the ceph definition. Move it there so
userland apps can ID cephfs.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 94cc0877 30-Nov-2021 Jeff Layton <jlayton@kernel.org>

ceph: add new "nopagecache" option

CephFS is a bit unlike most other filesystems in that it only
conditionally does buffered I/O based on the caps that it gets from the
MDS. In most cases, unless there is contended access for an inode the
MDS does give Fbc caps to the client, so the unbuffered codepaths are
only infrequently traveled and are difficult to test.

At one time, the "-o sync" mount option would give you this behavior,
but that was removed in commit 7ab9b3807097 ("ceph: Don't use
ceph-sync-mode for synchronous-fs.").

Add a new mount option to tell the client to ignore Fbc caps when doing
I/O, and to use the synchronous codepaths exclusively, even on
non-O_DIRECT file descriptors. We already have an ioctl that forces this
behavior on a per-file basis, so we can just always set the CEPH_F_SYNC
flag in the file description on such mounts.

Additionally, this patch also changes the client to not request Fbc when
doing direct I/O. We aren't using the cache with O_DIRECT so we don't
have any need for those caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0078ea3b 09-Nov-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't check for quotas on MDS stray dirs

玮文 胡 reported seeing the WARN_RATELIMIT pop when writing to an
inode that had been transplanted into the stray dir. The client was
trying to look up the quotarealm info from the parent and that tripped
the warning.

Change the ceph_vino_is_reserved helper to not throw a warning for
MDS stray directories (0x100 - 0x1ff), only for reserved dirs that
are not in that range.

Also, fix ceph_has_realms_with_quotas to return false when encountering
a reserved inode.

URL: https://tracker.ceph.com/issues/53180
Reported-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 8e55ba8c 10-Nov-2021 Kotresh HR <khiremat@redhat.com>

ceph: Fix incorrect statfs report for small quota

Problem:
The statfs reports incorrect free/available space for quota less then
CEPH_BLOCK size (4M).

Solution:
For quota less than CEPH_BLOCK size, smaller block size of 4K is used.
But if quota is less than 4K, it is decided to go with binary use/free
of 4K block. For quota size less than 4K size, report the
total=used=4K,free=0 when quota is full and total=free=4K,used=0
otherwise.

Signed-off-by: Kotresh HR <khiremat@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7b19b4db 14-Jul-2021 Venky Shankar <vshankar@redhat.com>

ceph: new device mount syntax

Old mount device syntax (source) has the following problems:

- mounts to the same cluster but with different fsnames
and/or creds have identical device string which can
confuse xfstests.

- Userspace mount helper tool resolves monitor addresses
and fill in mon addrs automatically, but that means the
device shown in /proc/mounts is different than what was
used for mounting.

New device syntax is as follows:

cephuser@fsid.mycephfs2=/path

Note, there is no "monitor address" in the device string.
That gets passed in as mount option. This keeps the device
string same when monitor addresses change (on remounts).

Also note that the userspace mount helper tool is backward
compatible. I.e., the mount helper will fallback to using
old syntax after trying to mount with the new syntax.

[ idryomov: drop CEPH_MON_ADDR_MNTOPT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 400e1286 07-Dec-2021 Jeff Layton <jlayton@kernel.org>

ceph: conversion to new fscache API

Now that the fscache API has been reworked and simplified, change ceph
over to use it.

With the old API, we would only instantiate a cookie when the file was
open for reads. Change it to instantiate the cookie when the inode is
instantiated and call use/unuse when the file is opened/closed.

Also, ensure we resize the cached data on truncates, and invalidate the
cache in response to the appropriate events. This will allow us to
plumb in write support later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4


# cbed4ff7 27-Oct-2021 Luís Henriques <lhenriques@suse.de>

ceph: split 'metric' debugfs file into several files

Currently, all the metrics are grouped together in a single file, making
it difficult to process this file from scripts. Furthermore, as new
metrics are added, processing this file will become even more challenging.

This patch turns the 'metric' file into a directory that will contain
several files, one for each metric.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 631ed4b0 14-Oct-2021 Jeff Layton <jlayton@kernel.org>

ceph: shut down mount on bad mdsmap or fsmap decode

As Greg pointed out, if we get a mangled mdsmap or fsmap, then something
has gone very wrong, and we should avoid doing any activity on the
filesystem.

When this occurs, shut down the mount the same way we would with a
forced umount by calling ceph_umount_begin when decoding fails on either
map. This causes most operations done against the filesystem to return
an error. Any dirty data or caps in the cache will be dropped as well.

The effect is not reversible, so the only remedy is to umount.

[ idryomov: print fsmap decoding error ]

URL: https://tracker.ceph.com/issues/52303
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5d6451b1 31-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: shut down access to inode when async create fails

Add proper error handling for when an async create fails. The inode
never existed, so any dirty caps or data are now toast. We already
d_drop the dentry in that case, but the now-stale inode may still be
around. We want to shut down access to these inodes, and ensure that
they can't harbor any more dirty data, which can cause problems at
umount time.

When this occurs, flag such inodes as being SHUTDOWN, and trash any caps
and cap flushes that may be in flight for them, and invalidate the
pagecache for the inode. Add a new helper that can check whether an
inode or an entire mount is now shut down, and call it instead of
accessing the mount_state directly in places where we test that now.

URL: https://tracker.ceph.com/issues/51279
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 36e6da98 02-Sep-2021 Jeff Layton <jlayton@kernel.org>

ceph: refactor remove_session_caps_cb

Move remove_capsnaps to caps.c. Move the part of remove_session_caps_cb
under i_ceph_lock into a separate function that lives in caps.c. Have
remove_session_caps_cb call the new helper after taking the lock.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f7a67b46 09-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: enable async dirops by default

Async dirops have been supported in mainline kernels for quite some time
now, and we've recently (as of June) started doing regular testing in
teuthology with '-o nowsync'. There were a few issues, but we've sorted
those out now.

Enable async dirops by default, and change /proc/mounts to show "wsync"
when they are disabled rather than "nowsync" when they are enabled.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1bd85aa6 07-Oct-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix handling of "meta" errors

Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a76d0a9c 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: don't WARN if we're forcibly removing the session caps

For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a6d37ccd 25-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: remove the capsnaps when removing caps

capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0ba92e1c 02-Aug-2021 Jeff Layton <jlayton@kernel.org>

ceph: add ceph_change_snap_realm() helper

Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0cad6246 18-Aug-2021 Miklos Szeredi <mszeredi@redhat.com>

vfs: add rcu argument to ->get_acl() callback

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b2f9fa1f 18-Aug-2021 Xiubo Li <xiubli@redhat.com>

ceph: correctly handle releasing an embedded cap flush

The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads. It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bf2ba432 06-Jul-2021 Luis Henriques <lhenriques@suse.de>

ceph: reduce contention in ceph_check_delayed_caps()

Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list. This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution). Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 23c2c76e 03-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: eliminate ceph_async_iput()

Now that we don't need to hold session->s_mutex or the snap_rwsem when
calling ceph_check_caps, we can eliminate ceph_async_iput and just use
normal iput calls.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 4364c693 13-May-2021 Jeff Layton <jlayton@kernel.org>

ceph: make ceph_queue_cap_snap static

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7a971e2c 01-Jun-2021 Jeff Layton <jlayton@kernel.org>

ceph: fix error handling in ceph_atomic_open and ceph_lookup

Commit aa60cfc3f7ee broke the error handling in these functions such
that they don't handle non-ENOENT errors from ceph_mdsc_do_request
properly.

Move the checking of -ENOENT out of ceph_handle_snapdir and into the
callers, and if we get a different error, return it immediately.

Fixes: aa60cfc3f7ee ("ceph: don't use d_add in ceph_handle_snapdir")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d4f6b31d 01-Apr-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't allow access to MDS-private inodes

The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.

Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.

URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e7f72952 27-Aug-2020 Yanhu Cao <gmayyyha@gmail.com>

ceph: support getting ceph.dir.rsnaps vxattr

Add support for grabbing the rsnaps value out of the inode info in
traces, and exposing that via ceph.dir.rsnaps xattr.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e72968e1 04-Apr-2021 Jeff Layton <jlayton@kernel.org>

ceph: drop pinned_page parameter from ceph_get_caps

All of the existing callers that don't set this to NULL just drop the
page reference at some arbitrary point later in processing. There's no
point in keeping a page reference that we don't use, so just drop the
reference immediately after checking the Uptodate flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aa60cfc3 01-Mar-2021 Jeff Layton <jlayton@kernel.org>

ceph: don't use d_add in ceph_handle_snapdir

It's possible ceph_get_snapdir could end up finding a (disconnected)
inode that already exists in the cache. Change the prototype for
ceph_handle_snapdir to return a dentry pointer and have it use
d_splice_alias so we don't end up with an aliased dentry in the cache.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7c46b318 21-Jan-2021 Jeff Layton <jlayton@kernel.org>

ceph: rework PageFsCache handling

With the new fscache API, the PageFsCache bit now indicates that the
page is being written to the cache and shouldn't be modified or released
until it's finished.

Change releasepage and invalidatepage to wait on that bit before
returning.

Also define FSCACHE_USE_NEW_IO_API so that we opt into the new fscache
API.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e7df4524 20-Jan-2021 Jeff Layton <jlayton@kernel.org>

ceph: rip out old fscache readpage handling

With the new netfs read helper functions, we won't need a lot of this
infrastructure as it handles the pagecache pages itself. Rip out the
read handling for now, and much of the old infrastructure that deals in
individual pages.

The cookie handling is mostly unchanged, however.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# a8810cdc 10-Dec-2020 Jeff Layton <jlayton@kernel.org>

ceph: allow queueing cap/snap handling after putting cap references

Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 64f28c62 09-Oct-2020 Jeff Layton <jlayton@kernel.org>

ceph: clean up inode work queueing

Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6646ea1c 12-Nov-2020 Luis Henriques <lhenriques@suse.de>

Revert "ceph: allow rename operation under different quota realms"

This reverts commit dffdcd71458e699e839f0bf47c3d42d64210b939.

When doing a rename across quota realms, there's a corner case that isn't
handled correctly. Here's a testcase:

mkdir files limit
truncate files/file -s 10G
setfattr limit -n ceph.quota.max_bytes -v 1000000
mv files limit/

The above will succeed because ftruncate(2) won't immediately notify the
MDSs with the new file size, and thus the quota realms stats won't be
updated.

Since the possible fixes for this issue would have a huge performance impact,
the solution for now is to simply revert to returning -EXDEV when doing a cross
quota realms rename.

URL: https://tracker.ceph.com/issues/48203
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 247b1f19 10-Nov-2020 Xiubo Li <xiubli@redhat.com>

ceph: add status debugfs file

This will help list some useful client side info, like the client
entity address/name and blocklisted status, etc.

URL: https://tracker.ceph.com/issues/48057
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dbeec07b 25-Sep-2020 Jeff Layton <jlayton@kernel.org>

ceph: remove timeout on allowing reconnect after blocklisting

30 minutes is a long time to wait, and this makes it difficult to test
the feature by manually blocklisting clients. Remove the timeout
infrastructure and just allow the client to reconnect at will.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aa5c7910 06-Oct-2020 Jeff Layton <jlayton@kernel.org>

ceph: make fsc->mount_state an int

This field is an unsigned long currently, which is a bit of a waste on
most arches since this just holds an enum. Make it (signed) int instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c74d79af 05-Oct-2020 Jeff Layton <jlayton@kernel.org>

ceph: comment cleanups and clarifications

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0b98acd6 14-Sep-2020 Ilya Dryomov <idryomov@gmail.com>

libceph, rbd, ceph: "blacklist" -> "blocklist"

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2678da88 03-Sep-2020 Xiubo Li <xiubli@redhat.com>

ceph: add ceph_sb_to_mdsc helper support to parse the mdsc

This will help simplify the code.

[ jlayton: fix minor merge conflict in quota.c ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ebce3eb2 18-Aug-2020 Jeff Layton <jlayton@kernel.org>

ceph: fix inode number handling on arches with 32-bit ino_t

Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).

I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).

What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.

Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.

The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.

While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.

NOTE: This is a user-visible change on 32-bit arches:

1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.

2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.

The workaround for both problems is to mount with "-o ino32".

[ idryomov: changelog tweak ]

URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a0102bda 30-Jul-2020 Jeff Layton <jlayton@kernel.org>

ceph: move sb->wb_pagevec_pool to be a global mempool

When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.

128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f1f565a2 17-Jul-2020 Randy Dunlap <rdunlap@infradead.org>

ceph: delete repeated words in fs/ceph/

Drop duplicated words "down" and "the" in fs/ceph/.

[ idryomov: merge into a single patch ]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 18f473b3 16-Jul-2020 Xiubo Li <xiubli@redhat.com>

ceph: periodically send perf metrics to MDSes

This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.

We can disable the metric sending via the disable_send_metrics module
parameter.

[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e64f44a8 27-May-2020 Xiubo Li <xiubli@redhat.com>

ceph: skip checking caps when session reconnecting and releasing reqs

It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.

URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# dffdcd71 07-Apr-2020 Luis Henriques <lhenriques@suse.com>

ceph: allow rename operation under different quota realms

Returning -EXDEV when trying to 'mv' files/directories from different
quota realms results in copy+unlink operations instead of the faster
CEPH_MDS_OP_RENAME. This will occur even when there aren't any quotas
set in the destination directory, or if there's enough space left for
the new file(s).

This patch adds a new helper function to be called on rename operations
which will allow these operations if they can be executed. This patch
mimics userland fuse client commit b8954e5734b3 ("client:
optimize rename operation under different quota root").

Since ceph_quota_is_same_realm() is now called only from this new
helper, make it static.

URL: https://tracker.ceph.com/issues/44791
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 829ad4db 03-Apr-2020 Jeff Layton <jlayton@kernel.org>

ceph: ceph_kick_flushing_caps needs the s_mutex

The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.

ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.

Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1cf03a68 01-Apr-2020 Jeff Layton <jlayton@kernel.org>

ceph: convert mdsc->cap_dirty to a per-session list

This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.

This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.

If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 4fb5dda3 19-Mar-2020 Jeff Layton <jlayton@kernel.org>

ceph: document what protects i_dirty_item and i_flushing_item

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1af16d54 19-Mar-2020 Xiubo Li <xiubli@redhat.com>

ceph: add caps perf metric for each superblock

Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f9009efa 19-Mar-2020 Xiubo Li <xiubli@redhat.com>

ceph: add dentry lease metric support

For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.

For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.

The output looks like:

item total miss hit
-------------------------------------------------
d_lease 11 7 141

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 135e671e 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: simplify calling of ceph_get_fmode()

Originally, calling ceph_get_fmode() for open files is by thread that
handles request reply. There is a small window between updating caps and
and waking the request initiator. We need to prevent ceph_check_caps()
from releasing wanted caps in the window.

Previous patches made fill_inode() call __ceph_touch_fmode() for open file
requests. This prevented ceph_check_caps() from releasing wanted caps for
'caps_wanted_delay_min' seconds, enough for request initiator to get
woken up and call ceph_get_fmode().

This allows us to now call ceph_get_fmode() in ceph_open() instead.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# a0d93e32 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: remove delay check logic from ceph_check_caps()

__ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and
'caps_wanted_delay_max'. There is no need to duplicate the logic in
ceph_check_caps() and __send_cap()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 719a2514 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: consider inode's last read/write when calculating wanted caps

Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
used to track the last time the client acquired read/write caps for
the inode.

If there is no read/write on an inode for 'caps_wanted_delay_max'
seconds, __ceph_caps_file_wanted() does not request caps for read/write
even there are open files.

Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
calculates dir's wanted caps according to last dir read/modification. If
there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.

Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
readdir, as with that, modifications do not need to release
CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c0e385b1 05-Mar-2020 Yan, Zheng <zyan@redhat.com>

ceph: always renew caps if mds_wanted is insufficient

Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag,
which indicates that mds has closed the session and caps were dropped.
Remove this flag in preparation for not requesting caps for idle open
files.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 785892fe 02-Jan-2020 Jeff Layton <jlayton@kernel.org>

ceph: cache layout in parent dir on first sync create

If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 966c7160 05-Dec-2019 Jeff Layton <jlayton@kernel.org>

ceph: make ceph_fill_inode non-static

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2ccb4546 02-Apr-2019 Jeff Layton <jlayton@kernel.org>

ceph: perform asynchronous unlink if we have sufficient caps

The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.

When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.

In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.

If the operation ends up failing, then set a writeback error on the
directory inode, and the inode itself that can be fetched later by
an fsync on the dir.

The behavior of dir caps is slightly different from caps on normal
files. Because these are just considered an optimization, if the
session is reconnected, we will not automatically reclaim them. They
are instead considered lost until we do another synchronous op in the
parent directory.

Async dirops are enabled via the "nowsync" mount option, which is
patterned after the xfs "wsync" mount option. For now, the default
is "wsync", but eventually we may flip that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 40dcf75e 14-Jan-2020 Jeff Layton <jlayton@kernel.org>

ceph: make __take_cap_refs non-static

Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 891f3f5a 14-Jan-2020 Jeff Layton <jlayton@kernel.org>

ceph: add infrastructure for waiting for async create to complete

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f5e17aed 18-Feb-2020 Jeff Layton <jlayton@kernel.org>

ceph: track primary dentry link

Newer versions of the MDS will flag a dentry as "primary". In later
patches, we'll need to consult this info, so track it in di->flags.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e8a4d267 25-Feb-2020 Jeff Layton <jlayton@kernel.org>

ceph: clean up kick_flushing_inode_caps()

The last thing that this function does is release i_ceph_lock, so
have the caller do that instead. Add a lockdep assertion to
ensure that the function is always called with i_ceph_lock held.
Change the prototype to take a ceph_inode_info pointer and drop
the separate mdsc argument as we can get that from the session.

While at it, make it non-static. We'll need this to kick any
flushing caps once the create reply comes in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 525d15e8 11-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}

These bits will have new meaning for directory inodes.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f85122af 02-Apr-2019 Jeff Layton <jlayton@kernel.org>

ceph: add refcounting for Fx caps

In future patches we'll be taking and relying on Fx caps. Add proper
refcounting for them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b27a939e 10-Feb-2020 Ilya Dryomov <idryomov@gmail.com>

ceph: canonicalize server path in place

syzbot reported that 4fbc0c711b24 ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.

We don't really need the supplied server path, so canonicalize it
in place and compare it directly. To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.

Fixes: 4fbc0c711b24 ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# 78beb0ff 08-Jan-2020 Luis Henriques <lhenriques@suse.com>

ceph: use copy-from2 op in copy_file_range

Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.

If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP. In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ad8c28a9 09-Sep-2019 Jeff Layton <jlayton@kernel.org>

ceph: convert int fields in ceph_mount_options to unsigned int

Most of these values should never be negative, so convert them to
unsigned values. Add some sanity checking to the parsed values, and
clean up some unneeded casts.

Note that while caps_max should never be negative, this patch leaves
it signed, since this value ends up later being compared to a signed
counter. Just ensure that userland never passes in a negative value
for caps_max.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 82995cc6 25-Mar-2019 David Howells <dhowells@redhat.com>

libceph, rbd, ceph: convert to use the new mount API

Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 721d5c13 01-Aug-2019 Jeff Layton <jlayton@kernel.org>

ceph: make several helper accessors take const pointers

None of these helper functions change anything in memory, so we can
declare their arguments as const.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 668959a5 06-Aug-2019 Jeff Layton <jlayton@kernel.org>

ceph: turn ceph_security_invalidate_secctx into static inline

No need to do an extra jump here. Also add some comments on the endifs.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 321fe13c 02-Aug-2019 Jeff Layton <jlayton@kernel.org>

ceph: add buffered/direct exclusionary locking for reads and writes

xfstest generic/451 intermittently fails. The test does O_DIRECT writes
to a file, and then reads back the result using buffered I/O, while
running a separate set of tasks that are also doing buffered reads.

The client will invalidate the cache prior to a direct write, but it's
easy for one of the other readers' replies to race in and reinstantiate
the invalidated range with stale data.

To fix this, we must to serialize direct I/O writes and buffered reads.
We could just sprinkle in some shared locks on the i_rwsem for reads,
and increase the exclusive footprint on the write side, but that would
cause O_DIRECT writes to end up serialized vs. other direct requests.

Instead, borrow the scheme used by nfs.ko. Buffered writes take the
i_rwsem exclusively, but buffered reads take a shared lock, allowing
them to run in parallel.

O_DIRECT requests also take a shared lock, but we need for them to not
run in parallel with buffered reads. A flag on the ceph_inode_info is
used to indicate whether it's in direct or buffered I/O mode. When a
conflicting request is submitted, it will block until the inode can be
flipped to the necessary mode.

Link: https://tracker.ceph.com/issues/40985
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# daca8bda 05-Jul-2019 Jeff Layton <jlayton@kernel.org>

ceph: remove CEPH_I_NOFLUSH

Nothing sets this flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5de16b30 23-Jul-2019 Jeff Layton <jlayton@kernel.org>

ceph: remove ceph_get_cap_mds and __ceph_get_cap_mds

Nothing calls these routines.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 131d7eb4 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: auto reconnect after blacklisted

Make client use osd reply and session message to infer if itself is
blacklisted. Client reconnect to cluster using new entity addr if it
is blacklisted. Auto reconnect is limited to once every 30 minutes.

Auto reconnect is disabled by default. It can be enabled/disabled by
recover_session=<no|clean> mount option. In 'clean' mode, client drops
any dirty data/metadata, invalidates page caches and invalidates all
writable file handles. After reconnect, file locks become stale because
MDS loses track of them. If an inode contains any stale file locks,
read/write on the indoe are not allowed until applications release all
stale file locks.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 81f148a9 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: invalidate all write mode filp after reconnect

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ff5d913d 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: return -EIO if read/write against filp that lost file locks

After mds evicts session, file locks get lost sliently. It's not safe to
let programs continue to do read/write.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d468e729 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: add helper function that forcibly reconnects to ceph cluster.

It closes mds sessions, drop all caps and invalidates page caches,
then use new entity address to reconnect to the cluster.

After reconnect, all dirty data/metadata are dropped, file locks
get lost sliently. Open files continue to work because client will
try renewing caps on later read/write.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5e3ded1b 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: pass filp to ceph_get_caps()

Also change several other functions' arguments, no logical changes.
This is preparetion for later patch that checks filp error.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# f4b97866 25-Jul-2019 Yan, Zheng <zyan@redhat.com>

ceph: track and report error of async metadata operation

Use errseq_t to track and report errors of async metadata operations,
similar to how kernel handles errors during writeback.

If any dirty caps or any unsafe request gets dropped during session
eviction, record -EIO in corresponding inode's i_meta_err. The error
will be reported by subsequent fsync,

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 12fe3dda 19-Jul-2019 Luis Henriques <lhenriques@suse.com>

ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()

Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can
be fixed by having this function returning the old blob buffer and have
the callers of this function freeing it when the lock is released.

The following backtrace was triggered by fstests generic/117.

BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
4 locks held by fsstress/649:
#0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
#1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
#2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
#3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_build_xattrs_blob+0x12b/0x170
__send_cap+0x302/0x540
? __lock_acquire+0x23c/0x1e40
? __mark_caps_flushing+0x15c/0x280
? _raw_spin_unlock+0x24/0x30
ceph_check_caps+0x5f0/0xc60
ceph_flush_dirty_caps+0x7c/0x150
? __ia32_sys_fdatasync+0x20/0x20
ceph_sync_fs+0x5a/0x130
iterate_supers+0x8f/0xf0
ksys_sync+0x4f/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fc6409ab617

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 52dd0f1b 05-Jul-2019 Luis Henriques <lhenriques@suse.com>

ceph: use generic_delete_inode() for ->drop_inode

ceph_drop_inode() implementation is not any different from the generic
function, thus there's no point in keeping it around.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 87bc5b89 01-Jun-2019 Yan, Zheng <zyan@redhat.com>

ceph: use ceph_evict_inode to cleanup inode's resource

remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freeing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40102
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 176c77c9 06-Jun-2019 Jeff Layton <jlayton@kernel.org>

ceph: handle change_attr in cap messages

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ec62b894 28-May-2019 Jeff Layton <jlayton@kernel.org>

ceph: handle btime in cap messages

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 245ce991 29-May-2019 Jeff Layton <jlayton@kernel.org>

ceph: add btime field to ceph_inode_info

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ac6713cc 26-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: add selinux support

When creating new file/directory, use security_dentry_init_security() to
prepare selinux context for the new inode, then send openc/mkdir request
to MDS, together with selinux xattr.

security_dentry_init_security() only supports single security module and
only selinux has dentry_init_security hook. So only selinux is supported
for now. We can add support for other security modules once kernel has a
generic version of dentry_init_security()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5c31e92d 26-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: rename struct ceph_acls_info to ceph_acl_sec_ctx

Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx().
And move their definitions to different files. This is preparation
for security label support.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d6e47819 22-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: hold i_ceph_lock when removing caps for freeing inode

ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 74960773 20-May-2019 Andrea Parri <andrea.parri@amarulasolutions.com>

ceph: fix improper use of smp_mb__before_atomic()

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic64_set() primitive.

Replace the barrier with an smp_mb().

Fixes: fdd4e15838e59 ("ceph: rework dcache readdir")
Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d0f191d2 18-Apr-2019 David Disseldorp <ddiss@suse.de>

ceph: remove unused vxattr length helpers

ceph_listxattr() now calculates the length of vxattrs dynamically, so
these helpers, which incorrectly ignore vxattr.exists_cb(), can be
removed.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 193e7b37 18-Apr-2019 David Disseldorp <ddiss@suse.de>

ceph: carry snapshot creation time with inodes

MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1a829ff2 12-Jun-2019 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ceph: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

This cleanup allows the return value of the functions to be made void,
as no logic should care if these files succeed or not.

Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612145538.GA18772@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 3e1d0452 18-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: avoid iput_final() while holding mutex or in dispatch thread

iput_final() may wait for reahahead pages. The wait can cause deadlock.
For example:

Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
schedule+0x36/0x80
io_schedule+0x16/0x40
__lock_page+0x101/0x140
truncate_inode_pages_range+0x556/0x9f0
truncate_inode_pages_final+0x4d/0x60
evict+0x182/0x1a0
iput+0x1d2/0x220
iterate_session_caps+0x82/0x230 [ceph]
dispatch+0x678/0xa80 [ceph]
ceph_con_workfn+0x95b/0x1560 [libceph]
process_one_work+0x14d/0x410
worker_thread+0x4b/0x460
kthread+0x105/0x140
ret_from_fork+0x22/0x40

Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
__schedule+0x3d6/0x8b0
schedule+0x36/0x80
schedule_preempt_disabled+0xe/0x10
mutex_lock+0x2f/0x40
ceph_check_caps+0x505/0xa80 [ceph]
ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
writepages_finish+0x2d3/0x410 [ceph]
__complete_request+0x26/0x60 [libceph]
handle_reply+0x6c8/0xa10 [libceph]
dispatch+0x29a/0xbb0 [libceph]
ceph_con_workfn+0x95b/0x1560 [libceph]
process_one_work+0x14d/0x410
worker_thread+0x4b/0x460
kthread+0x105/0x140
ret_from_fork+0x22/0x40

In above example, truncate_inode_pages_range() waits for readahead pages
while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
OSD dispatch thread. Later OSD replies (for readahead) can't be handled.

ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
can happen if iput_final() is called while holding snap_rwsem.

In general, it's not good to call iput_final() inside MDS/OSD dispatch
threads or while holding any mutex.

The fix is introducing ceph_async_iput(), which calls iput_final() in
workqueue.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1cf89a8d 17-May-2019 Yan, Zheng <zyan@redhat.com>

ceph: single workqueue for inode related works

We have three workqueue for inode works. Later patch will introduce
one more work for inode. It's not good to introcuce more workqueue
and add more 'struct work_struct' to 'struct ceph_inode_info'.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 0c44a8e0 21-Mar-2019 Luis Henriques <lhenriques@suse.com>

ceph: quota: fix quota subdir mounts

The CephFS kernel client does not enforce quotas set in a directory that
isn't visible from the mount point. For example, given the path
'/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with

mount -t ceph <server>:<port>:/dir1/ /mnt

then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
to a quota realm that points to it.

This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
unknown inodes. Any inode reference obtained this way will be added to a
list in ceph_mds_client, and will only be released when the filesystem is
umounted.

Link: https://tracker.ceph.com/issues/38482
Reported-by: Hendrik Peyerl <hpeyerl@plusline.net>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 3886274a 21-Mar-2019 Luis Henriques <lhenriques@suse.com>

ceph: factor out ceph_lookup_inode()

This function will be used by __fh_to_dentry and by the quotas code, to
find quota realm inodes that are not visible in the mountpoint.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cfa6d412 10-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

ceph: use ->free_inode()

a lot of non-delayed work in this case; all of that is left in
->destroy_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fe33032d 31-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: add mount option to limit caps count

If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.

By default, there is no limit for caps count.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 37c4efc1 31-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: periodically trim stale dentries

Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.

There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1e9c2eb6 28-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: delete stale dentry when last reference is dropped

introduce ceph_d_delete(), which checks if dentry has valid lease.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e3ec8d68 14-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: send cap releases more aggressively

When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 08796873 08-Jan-2019 Yan, Zheng <zyan@redhat.com>

ceph: support getting ceph.dir.pin vxattr

Link: http://tracker.ceph.com/issues/37576
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 75c9627e 14-Dec-2017 Yan, Zheng <zyan@redhat.com>

ceph: map snapid to anonymous bdev ID

ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.

This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.

Link: http://tracker.ceph.com/issues/22353
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ca79b0c2 28-Dec-2018 Arun KS <arunks@codeaurora.org>

mm: convert totalram_pages and totalhigh_pages variables to atomic

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6f9718fe 10-Dec-2018 Luis Henriques <lhenriques@suse.com>

ceph: make 'nocopyfrom' a default mount option

Since we found a problem with the 'copy-from' operation after objects have
been truncated, offloading object copies to OSDs should be discouraged
until the issue is fixed.

Thus, this patch adds the 'nocopyfrom' mount option to the default mount
options which effectily means that remote copies won't be done in
copy_file_range unless they are explicitly enabled at mount time.

[ Adjust ceph_show_options() accordingly. ]

Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# ea4cdc54 15-Oct-2018 Luis Henriques <lhenriques@suse.com>

ceph: new mount option to disable usage of copy-from op

Add a new mount option 'nocopyfrom' that will prevent the usage of the
RADOS 'copy-from' operation in cephfs. This could be useful, for example,
for an administrator to temporarily mitigate any possible bugs in the
'copy-from' implementation.

Currently, only copy_file_range uses this RADOS operation. Setting this
mount option will result in this syscall reverting to the default VFS
implementation, i.e. to perform the copies locally instead of doing remote
object copies.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2ee9dd95 15-Oct-2018 Luis Henriques <lhenriques@suse.com>

ceph: add non-blocking parameter to ceph_try_get_caps()

ceph_try_get_caps currently calls try_get_cap_refs with the nonblock
parameter always set to 'true'. This change adds a new parameter that
allows to set it's value. This will be useful for a follow-up patch that
will need to get two sets of capabilities for two different inodes without
risking a deadlock.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 7bf8f736 28-Jul-2018 Chengguang Xu <cgxu519@gmx.com>

ceph: refactor ceph_unreserve_caps()

The code of ceph_unreserve_caps() and error handling in
ceph_reserve_caps() are duplicated, so introduce a helper
__ceph_unreserve_caps() to reduce duplicated code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 719784ba 19-Jul-2018 Chengguang Xu <cgxu519@gmx.com>

ceph: add new field max_file_size in ceph_fs_client

In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 9bbeab41 13-Jul-2018 Arnd Bergmann <arnd@arndb.de>

ceph: use timespec64 for inode timestamp

Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.

In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 44907d79 08-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

get rid of 'opened' argument of ->atomic_open() - part 3

now it can be done...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9122eed5 31-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: report root dir quota usage in statfs

This commit changes statfs default behaviour when reporting usage
statistics. Instead of using the overall filesystem usage, statfs now
reports the quota for the filesystem root, if ceph.quota.max_bytes has
been set for this inode. If quota hasn't been set, it falls back to the
old statfs behaviour.

A new mount option is also added ('noquotadf') to disable this behaviour.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# d557c48d 12-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: add counter for snaprealms with quota

By keeping a counter with the number of snaprealms that have quota set
allows to optimize the functions that need to walk throught the realms
hierarchy looking for quotas. Thus, if this counter is zero it's safe to
assume that there are no realms with quota.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e3161f17 12-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: cache inode pointer in ceph_snap_realm

Keep a pointer to the inode in struct ceph_snap_realm. This allows to
optimize functions that walk the realms hierarchy (e.g. in quotas).

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1ab302a0 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: update MDS when max_bytes is approaching

When we're reaching the ceph.quota.max_bytes limit, i.e., when writing
more than 1/16th of the space left in a quota realm, update the MDS with
the new file size.

This mirrors the fuse-client approach with commit 122c50315ed1 ("client:
Inform mds file size when approaching quota limit"), in the ceph git tree.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2b83845f 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: support for ceph.quota.max_bytes

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# cafe21a4 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: don't allow cross-quota renames

This patch changes ceph_rename so that -EXDEV is returned if an attempt is
made to mv a file between two different dir trees with different quotas
setup.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b7a29217 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: support for ceph.quota.max_files

This patch adds support for the max_files quota. It hooks into all the
ceph functions that add new filesystem objects that need to be checked
against the quota limits. When these limits are hit, -EDQUOT is returned.

Note that we're not checking quotas on ceph_link(). ceph_link doesn't
really create a new inode, and since the MDS doesn't update the directory
statistics when a new (hard) link is created (only with symlinks), they
are not accounted as a new file.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fb18a575 05-Jan-2018 Luis Henriques <lhenriques@suse.com>

ceph: quota: add initial infrastructure to support cephfs quotas

This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client. Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# bb48bd4d 12-Mar-2018 Chengguang Xu <cgxu519@gmx.com>

ceph: optimize memory usage

In current code, regular file and directory use same struct
ceph_file_info to store fs specific data so the struct has to
include some fields which are only used for directory
(e.g., readdir related info), when having plenty of regular files,
it will lead to memory waste.

This patch introduces dedicated ceph_dir_file_info cache for
readdir related thins. So that regular file does not include those
unused fields anymore.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 6ef0bc6d 24-Jan-2018 Zhi Zhang <zhang.david2011@gmail.com>

ceph: flush dirty caps of unlinked inode ASAP

Client should release unlinked inode from its cache ASAP. But client
can't release inode with dirty caps.

Link: http://tracker.ceph.com/issues/22886
Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# e30ee581 24-Jan-2018 Zhi Zhang <zhang.david2011@gmail.com>

ceph: try to allocate enough memory for reserved caps

ceph_reserve_caps() may not reserve enough caps under high memory
pressure, but it saved the needed caps number that expected to
be reserved. When getting caps, crash would happen due to number
mismatch.

Now we will try to trim more caps when failing to allocate memory
for caps need to be reserved, then try again. If still failing to
allocate memory, return -ENOMEM.

Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5d988308 14-Dec-2017 Yan, Zheng <zyan@redhat.com>

ceph: track read contexts in ceph_file_info

Previously ceph_read_iter() uses current->journal to pass context info
to ceph_readpages(), so that ceph_readpages() can distinguish read(2)
from readahead(2)/fadvise(2)/madvise(2). The problem is that page fault
can happen when copying data to userspace memory. Page fault may call
other filesystem's page_mkwrite() if the userspace memory is mapped to a
file. The later filesystem may also want to use current->journal.

The fix is define a on-stack data structure in ceph_read_iter(), add it
to context list in ceph_file_info. ceph_readpages() searches the list,
find if there is a context belongs to current thread.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 97aeb6bf 26-Nov-2017 Yan, Zheng <zyan@redhat.com>

ceph: use atomic_t for ceph_inode_info::i_shared_gen

It allows accessing i_shared_gen without holding i_ceph_lock. It is
preparation for later patch.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b3f8d68f 10-Sep-2017 Yan, Zheng <zyan@redhat.com>

ceph: handle 'session get evicted while there are file locks'

When session get evicted, all file locks associated with the session
get released remotely by mds. File locks tracked by kernel become
stale. In this situation, set an error flag on inode. The flag makes
further file locks return -EIO.

Another option to handle this situation is cleanup file locks tracked
kernel. I do not choose it because it is inconvenient to notify user
program about the error.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c6db8472 10-Sep-2017 Yan, Zheng <zyan@redhat.com>

ceph: make lock_to_ceph_filelock() static

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 89aa5930 08-Sep-2017 Yan, Zheng <zyan@redhat.com>

ceph: keep auth cap when inode has flocks or posix locks

file locks are tracked by inode's auth mds. dropping auth caps
is equivalent to releasing all file locks.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 4214fb15 11-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: validate correctness of some mount options

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 95cca2b4 11-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: limit osd write size

OSD has a configurable limitation of max write size. OSD return
error if write request size is larger than the limitation. For now,
set max write size to CEPH_MSG_MAX_DATA_LEN. It should be small
enough.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# aa187926 11-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: limit osd read size to CEPH_MSG_MAX_DATA_LEN

libceph returns -EIO when read size > CEPH_MSG_MAX_DATA_LEN.

Link: http://tracker.ceph.com/issues/20528
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2ae409dc 11-Jul-2017 Yan, Zheng <zyan@redhat.com>

ceph: remove unused cap_release_safety mount option

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 1d8f8360 26-Jun-2017 Yan, Zheng <zyan@redhat.com>

ceph: new mount option that specifies fscache uniquifier

Current ceph uses FSID as primary index key of fscache data. This
allows ceph to retain cached data across remount. But this causes
problem (kernel opps, fscache does not support sharing data) when
a filesystem get mounted several times (with fscache enabled, with
different mount options).

The fix is adding a new mount option, which specifies uniquifier
for fscache.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# efb0ca76 21-May-2017 Yan, Zheng <zyan@redhat.com>

ceph: update the 'approaching max_size' code

The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 26544c62 04-Apr-2017 Jeff Layton <jlayton@kernel.org>

ceph: when seeing write errors on an inode, switch to sync writes

Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.

In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.

Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 805692d0 03-Mar-2017 Elena Reshetova <elena.reshetova@intel.com>

ceph: convert ceph_cap_snap.nref from atomic_t to refcount_t

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 09dc9fc2 11-Apr-2017 Jan Kara <jack@suse.cz>

ceph: Convert to separately allocated bdi

Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: "Yan, Zheng" <zyan@redhat.com>
CC: Sage Weil <sage@redhat.com>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# a528d35e 31-Jan-2017 David Howells <dhowells@redhat.com>

statx: Add a system call to make enhanced file info available

Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included. The
following have been included:

(1) Make the fields a consistent size on all arches and make them large.

(2) Spare space, request flags and information flags are provided for
future expansion.

(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).

(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).

This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].

(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).

(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.

(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).

(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].

(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].

(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...

(This requires a separate system call - I have an fsinfo() call idea
for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].

(Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].

(Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).

(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.

(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.

(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data. This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};

struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};

The defined bits in request_mask and stx_mask are:

STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:

STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

STATX_ATTR_AUTOMOUNT Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

(0) stx_dev_*, stx_blksize.

These are local system information and are always available.

(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.

If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.

Note that there are instances where the type might not be valid, for
instance Windows reparse points.

(2) stx_rdev_*.

This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.

(3) stx_btime.

Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 55f2a045 13-Feb-2017 Ilya Dryomov <idryomov@gmail.com>

ceph: remove special ack vs commit behavior

- ask for a commit reply instead of an ack reply in
__ceph_pool_perm_get()
- don't ask for both ack and commit replies in ceph_sync_write()
- since just only one reply is requested now, i_unsafe_writes list
will always be empty -- kill ceph_sync_write_wait() and go back to
a standard ->evict_inode()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# f5a03b08 31-Jan-2017 Jeff Layton <jlayton@kernel.org>

ceph: drop session argument to ceph_fill_trace

Just get it from r_session since that's what's always passed in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# c1944fed 29-Jan-2017 Yan, Zheng <zyan@redhat.com>

ceph: avoid calling ceph_renew_caps() infinitely

__ceph_caps_mds_wanted() ignores caps from stale session. So the
return value of __ceph_caps_mds_wanted() can keep the same across
ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
ceph_renew_caps(). The fix is ignore the session valid check for
the try_get_cap_refs() case. If session is stale, just let the
caps requester sleep.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 7c94ba27 10-Jan-2017 Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc>

ceph: set io_pages bdi hint

This patch sets the io_pages bdi hint based on the rsize mount option.
Without this patch large buffered reads (request size > max readahead)
are processed sequentially in chunks of the readahead size (i.e. read
requests are sent out up to the readahead size, then the
do_generic_file_read() function waits until the first page is received).

With this patch read requests are sent out at once up to the size
specified in the rsize mount option (default: 64 MB).

Signed-off-by: Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>


# ca6c8ae0 15-Dec-2016 Jeff Layton <jlayton@kernel.org>

ceph: pass parent inode info to ceph_encode_dentry_release if we have it

If we have a parent inode reference already, then we don't need to
go back up the directory tree to find one.

Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 5f743e45 15-Nov-2016 Yan, Zheng <zyan@redhat.com>

ceph: record truncate size/seq for snap data writeback

Dirty snapshot data needs to be flushed unconditionally. If they
were created before truncation, writeback should use old truncate
size/seq.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e9e427f0 10-Nov-2016 Yan, Zheng <zyan@redhat.com>

ceph: check availability of mds cluster on mount

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 2b1ac852 24-Oct-2016 Yan, Zheng <zyan@redhat.com>

ceph: try getting buffer capability for readahead/fadvise

For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# ad5cb123 28-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph: switch to use of ->d_init()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18fc8abd 28-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

ceph: unify dentry_operations instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c8799fc4 07-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: optimize cap flush waiting

Add a 'wake' flag to ceph_cap_flush struct, which indicates if there
is someone waiting for it to finish. When getting flush ack message,
we check the 'wake' flag in corresponding ceph_cap_flush struct to
decide if we should wake up waiters. One corner case is that the
acked cap flush has 'wake' flags is set, but it is not the first one
on the flushing list. We do not wake up waiters in this case, set
'wake' flags of preceding ceph_cap_flush struct instead

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# ed9b430c 05-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: cleanup ceph_flush_snaps()

This patch devide __ceph_flush_snaps() into two stags. In the first
stage, __ceph_flush_snaps() assign snapcaps flush TIDs and add them
to cap flush lists. __ceph_flush_snaps() keeps holding the
i_ceph_lock in this stagge. So inode's auth cap can not change. In
the second stage, __ceph_flush_snaps() send flushsnap cap messages.
i_ceph_lock is unlocked before sending each cap message. If auth cap
changes in the middle, __ceph_flush_snaps() just stops. This is OK
because kick_flushing_inode_caps() will re-send flushsnap cap messages
to inode's new auth MDS.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 70220ac8 06-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: introduce an inode flag to indicates if snapflush is needed

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 13c2b57d 05-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: avoid sending duplicated cap flush message

make ceph_kick_flushing_caps() ignore inodes whose cap flushes
have already been re-sent by ceph_early_kick_flushing_caps()

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 0e294387 04-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: unify cap flush and snapcap flush

This patch includes following changes
- Assign flush tid to snapcap flush
- Remove session's s_cap_snaps_flushing list. Add inode to session's
s_cap_flushing list instead. Inode is removed from the list when
there is no pending snapcap flush or cap flush.
- make __kick_flushing_caps() re-send both snapcap flushes and cap
flushes.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e4500b5e 05-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: use list instead of rbtree to track cap flushes

We don't have requirement of searching cap flush by TID. In most cases,
we just need to know TID of the oldest cap flush. List is ideal for this
usage.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 430afbad 07-Jul-2016 Yan, Zheng <zyan@redhat.com>

ceph: mount non-default filesystem by name

To mount non-default filesytem, user currently needs to provide mds
namespace ID. This is inconvenience.

This patch makes user be able to mount filesystem by name. If user
wants to mount non-default filesystem. Client first subscribes to
fsmap.user. Subscribe to mdsmap.<ID> after getting ID of filesystem.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 9b16f03c 22-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

ceph: don't use ->d_time

Pretty simple: just use ceph_dentry_info.time instead (which was already
there, unused).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 9a5530c6 15-Jun-2016 Yan, Zheng <zyan@redhat.com>

ceph: wait unsafe sync writes for evicting inode

Otherwise ceph_sync_write_unsafe() may access/modify freed inode.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 774a6a11 06-Jun-2016 Yan, Zheng <zyan@redhat.com>

ceph: reduce i_nr_by_mode array size

Track usage count for individual fmode bit. This can reduce the
array size by half.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 779fe0fb 06-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: rados pool namespace support

This patch adds codes that decode pool namespace information in
cap message and request reply. Pool namespace is saved in i_layout,
it will be passed to libceph when doing read/write.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f7f7e7a0 18-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: improve fscache revalidation

There are several issues in fscache revalidation code.
- In ceph_revalidate_work(), fscache_invalidate() is called when
fscache_check_consistency() return 0. This is complete wrong
because 0 means cache is valid.
- Handle_cap_grant() calls ceph_queue_revalidate() if client
already has CAP_FILE_CACHE. This code is confusing. Client
should revalidate the cache each time it got CAP_FILE_CACHE
anew.
- In Handle_cap_grant(), fscache_invalidate() is called if MDS
revokes CAP_FILE_CACHE. This is inconsistency with the case
that inode get evicted. In the later case, the cache is not
discarded. Client may use the cache when inode is reloaded.

This patch moves the fscache revalidation into ceph_get_caps().
Client revalidates the cache after it gets CAP_FILE_CACHE.
i_rdcache_gen should keep constance while CAP_FILE_CACHE is
used. If i_fscache_gen is not equal to i_rdcache_gen, client
needs to check cache's consistency.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 1b1bc16d 03-May-2016 Yan, Zheng <zyan@redhat.com>

ceph: improve fragtree change detection

check if number of splits in i_fragtree is equal to number of splits
in mds reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f3c4ebe6 28-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: using hash value to compose dentry offset

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

(0xff << 52) | ((24 bits hash) << 28) |
(the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8974eebd 28-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: record 'offset' for each entry of readdir result

This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 3f384954 20-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: report mount root in session metadata

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 77310320 08-Apr-2016 Yan, Zheng <zyan@redhat.com>

ceph: renew caps for read/write if mds session got killed.

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 235a0982 30-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: multiple filesystem support

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 2cdeb1e4 13-Apr-2016 Andreas Gruenbacher <agruenba@redhat.com>

ceph: Switch to generic xattr handlers

Add a catch-all xattr handler at the end of ceph_xattr_handlers. Check
for valid attribute names there, and remove those checks from
__ceph_{get,set,remove}xattr instead. No "system.*" xattrs need to be
handled by the catch-all handler anymore.

The set xattr handler is called with a NULL value to indicate that the
attribute should be removed; __ceph_setxattr already handles that case
correctly (ceph_set_acl could already calling __ceph_setxattr with a NULL
value).

Move the check for snapshots from ceph_{set,remove}xattr into
__ceph_{set,remove}xattr. With that, ceph_{get,set,remove}xattr can be
replaced with the generic iops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a26fecca 13-Apr-2016 Andreas Gruenbacher <agruenba@redhat.com>

ceph: Get rid of d_find_alias in ceph_set_acl

Create a variant of ceph_setattr that takes an inode instead of a
dentry. Change __ceph_setxattr (and also __ceph_removexattr) to take an
inode instead of a dentry. Use those in ceph_set_acl so that we no
longer need a dentry there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ce23e640 10-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

->getxattr(): pass dentry and inode as separate arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 641235d8 16-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: kill ceph_get_dentry_parent_inode()

use vfs helper dget_parent() instead

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 315f2408 06-Mar-2016 Yan, Zheng <zyan@redhat.com>

ceph: fix security xattr deadlock

When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 34b759b4 16-Feb-2016 Ilya Dryomov <idryomov@gmail.com>

ceph: kill ceph_empty_snapc

ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to
ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is
equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only
for sizing the request message.

Further, in all four cases the subsequent ceph_osdc_build_request() is
passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps
and making ceph_empty_snapc entirely useless. The two cases where it
actually mattered were removed in commits 860560904962 ("ceph: avoid
sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix
queuing inode to mdsdir's snaprealm").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>


# 133e9156 24-Jan-2016 Yan, Zheng <zyan@redhat.com>

ceph: don't enable rbytes mount option by default

When rbytes mount option is enabled, directory size is recursive
size. Recursive size is not updated instantly. This can cause
directory size to change between successive stat(1)

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 5ea5c5e0 14-Feb-2016 Yan, Zheng <zyan@redhat.com>

ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support

Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 68cd5b4b 27-Oct-2015 Yan, Zheng <zyan@redhat.com>

ceph: make fsync() wait unsafe requests that created/modified inode

If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# fc927cd3 19-Jul-2015 Yan, Zheng <zyan@redhat.com>

ceph: always re-send cap flushes when MDS recovers

commit e548e9b93d3e565e42b938a99804114565be1f81 makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.

This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# fdd4e158 16-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: rework dcache readdir

Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# f66fd9f0 10-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: pre-allocate data structure that tracks caps flushing

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e548e9b9 10-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: re-send flushing caps (which are revoked) in reconnect stage

if flushing caps were revoked, we should re-send the cap flush in
client reconnect stage. This guarantees that MDS processes the cap
flush message before issuing the flushing caps to other client.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 8310b089 09-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: track pending caps flushing globally

So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.

Tracking pending caps flushing globally also simplifies syncfs code.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 553adfd9 09-Jun-2015 Yan, Zheng <zyan@redhat.com>

ceph: track pending caps flushing accurately

Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.

This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 745a8e3b 14-May-2015 Yan, Zheng <zyan@redhat.com>

ceph: don't pre-allocate space for cap release messages

Previously we pre-allocate cap release messages for each caps. This
wastes lots of memory when there are large amount of caps. This patch
make the code not pre-allocate the cap release messages. Instead,
we add the corresponding ceph_cap struct to a list when releasing a
cap. Later when flush cap releases is needed, we allocate the cap
release messages dynamically.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 86056090 01-May-2015 Yan, Zheng <zyan@redhat.com>

ceph: avoid sending unnessesary FLUSHSNAP message

when a snap notification contains no new snapshot, we can avoid
sending FLUSHSNAP message to MDS. But we still need to create
cap_snap in some case because it's required by write path and
page writeback path

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 7b06a826 30-Apr-2015 Yan, Zheng <zyan@redhat.com>

ceph: use empty snap context for uninline_data and get_pool_perm

Cached_context in ceph_snap_realm is directly accessed by
uninline_data() and get_pool_perm(). This is racy in theory.
both uninline_data() and get_pool_perm() do not modify existing
object, they only create new object. So we can pass the empty
snap context to them. Unlike cached_context in ceph_snap_realm,
we do not need to protect the empty snap context.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 10183a69 27-Apr-2015 Yan, Zheng <zyan@redhat.com>

ceph: check OSD caps before read/write

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e1eba3ea 03-Mar-2015 Fabian Frederick <fabf@skynet.be>

ceph: remove redundant declaration

ceph_aops was already defined extern in addr.c section

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e2c3de04 04-Mar-2015 Yan, Zheng <zyan@redhat.com>

ceph: fix dcache/nocache mount option

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 38c48b5f 13-Jan-2015 Yan, Zheng <zyan@redhat.com>

ceph: provide seperate {inode,file}_operations for snapdir

remove all unsupported operations from {inode,file}_operations.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 982d6011 23-Dec-2014 Yan, Zheng <zyan@redhat.com>

ceph: improve reference tracking for snaprealm

When snaprealm is created, its initial reference count is zero.
But in some rare cases, the newly created snaprealm is not referenced
by anyone. This causes snaprealm with zero reference count not freed.

The fix is set reference count of newly snaprealm to 1. The reference
is return the function who requests to create the snaprealm. When the
function finishes its job, it releases the reference.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# e20d258d 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: flush inline version

After converting inline data to normal data, client need to flush
the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes
cap messages (sent to MDS) contain inline_version and inline_data.
Client always converts inline data to normal data before data write,
so the inline data length part is always zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 28127bdd 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: convert inline data to normal data before data write

Before any data write, convert inline data to normal data and set
i_inline_version to CEPH_INLINE_NONE. The OSD request that saves
inline data to object contains 3 operations (CMPXATTR, WRITE and
SETXATTR). It compares a xattr named 'inline_version' to prevent
old data overwrites newer data.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 01deead0 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: use getattr request to fetch inline data

Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 31c542a1 14-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: add inline data to pagecache

Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 97c85a82 06-Nov-2014 Yan, Zheng <zyan@redhat.com>

ceph: introduce global empty snap context

Current snaphost code does not properly handle moving inode from one
empty snap realm to another empty snap realm. After changing inode's
snap realm, some dirty pages' snap context can be not equal to inode's
i_head_snap. This can trigger BUG() in ceph_put_wrbuffer_cap_refs()

The fix is introduce a global empty snap context for all empty snap
realm. This avoids triggering the BUG() for filesystem with no snapshot.

Fixes: http://tracker.ceph.com/issues/9928

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>


# 70db4f36 21-Oct-2014 Yan, Zheng <zyan@redhat.com>

ceph: introduce a new inode flag indicating if cached dentries are ordered

After creating/deleting/renaming file, offsets of sibling dentries may
change. So we can not use cached dentries to satisfy readdir. But we can
still use the cached dentries to conclude -ENOENT for lookup.

This patch introduces a new inode flag indicating if child dentries are
ordered. The flag is set at the same time marking a directory complete.
After creating/deleting/renaming file, we clear the flag on directory
inode. This prevents ceph_readdir() from using cached dentries to satisfy
readdir syscall.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# 14ed9703 12-Sep-2014 John Spray <john.spray@redhat.com>

ceph: additional debugfs output

MDS session state and client global ID is
useful instrumentation when testing.

Signed-off-by: John Spray <john.spray@redhat.com>


# b1ee94aa 16-Sep-2014 Yan, Zheng <zyan@redhat.com>

ceph: include the initial ACL in create/mkdir/mknod MDS requests

Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.

The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>


# 508b32d8 16-Sep-2014 Yan, Zheng <zyan@redhat.com>

ceph: request xattrs if xattr_version is zero

Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply
doesn't contain xattrs because MDS didn't receive the cap release
message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>


# d9df2783 17-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: pre-allocate ceph_cap struct for ceph_add_cap()

So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 0a8a70f9 13-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: clear directory's completeness when creating file

When creating a file, ceph_set_dentry_offset() puts the new dentry
at the end of directory's d_subdirs, then set the dentry's offset
based on directory's max offset. The offset does not reflect the
real postion of the dentry in directory. Later readdir reply from
MDS may change the dentry's position/offset. This inconsistency
can cause missing/duplicate entries in readdir result if readdir
is partly satisfied by dcache_readdir().

The fix is clear directory's completeness after creating/renaming
file. It prevents later readdir from using dcache_readdir().

Fixes: http://tracker.ceph.com/issues/8025
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# eb13e832 09-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use fl->fl_file as owner identifier of flock and posix lock

flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# f0494206 27-Feb-2014 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix ceph_dir_llseek()

Comparing offset with inode->i_sb->s_maxbytes doesn't make sense for
directory. For a fragmented directory, offset (frag_t, off) can be
larger than inode->i_sb->s_maxbytes.

At the very beginning of ceph_dir_llseek(), local variable old_offset
is initialized to parameter offset. This doesn't make sense neither.
Old_offset should be ceph_make_fpos(fi->frag, fi->next_offset).

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@linaro.org>


# c969d9bf 16-Feb-2014 Guangliang Zhao <lucienchao@gmail.com>

ceph: make ceph_forget_all_cached_acls() static inline

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Sage Weil <sage@inktank.com>


# 32d35d44 30-Jan-2014 Peter Rosin <peda@lysator.liu.se>

ceph: remove duplicate declaration of ceph_setattr

Signed-off-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Sage Weil <sage@inktank.com>


# 72466d0b 29-Jan-2014 Sage Weil <sage@inktank.com>

ceph: fix posix ACL hooks

The merge of commit 7221fe4c2ed7 ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957d0
"fs: add generic xattr_acl handlers" and others).

Some of the fallout was fixed in commit 4db658ea0ca ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4db658ea 28-Jan-2014 Linus Torvalds <torvalds@linux-foundation.org>

ceph: Fix up after semantic merge conflict

The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2ed7 ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957d0 "fs: add generic
xattr_acl handlers", 5bf3258fd2ac "fs: make posix_acl_chmod more useful"
and 37bc15392a23 "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict. And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way. And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too. But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this. Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11df2dfb 23-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: add imported caps when handling cap export message

Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 9215aeea 29-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: check inode caps in ceph_d_revalidate

Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 9563f88c 21-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix cache revoke race

handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 7221fe4c 11-Nov-2013 Guangliang Zhao <lucienchao@gmail.com>

ceph: add acl for cephfs

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>


# 9f12bd11 20-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: drop unconnected inodes

Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# a096b09a 21-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: queue cap release in __ceph_remove_cap()

call __queue_cap_release() in __ceph_remove_cap(), this avoids
acquiring s_cap_lock twice.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# ed284c49 02-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: remove ceph_lookup_inode()

commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
which provides similar function. So remove ceph_lookup_inode(), use
ceph_find_inode() instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
Reviewed-by: Sage Weil <sage@inktank.com>


# 99ccbd22 21-Aug-2013 Milosz Tanski <milosz@adfin.com>

ceph: use fscache as a local presisent cache

Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>


# b0d7c223 12-Aug-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: introduce i_truncate_mutex

I encountered below deadlock when running fsstress

wmtruncate work truncate MDS
--------------- ------------------ --------------------------
lock i_mutex
<- truncate file
lock i_mutex (blocked)
<- revoking Fcb (filelock to MIX)
send request ->
handle request (xlock filelock)

At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.

When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.

The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 6f60f889 23-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix freeing inode vs removing session caps race

remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.

We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# b415bf4f 01-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: fix pending vmtruncate race

The locking order for pending vmtruncate is wrong, it can lead to
following race:

write wmtruncate work
------------------------ ----------------------
lock i_mutex
check i_truncate_pending check i_truncate_pending
truncate_inode_pages() lock i_mutex (blocked)
copy data to page cache
unlock i_mutex
truncate_inode_pages()

The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()

Fixes: http://tracker.ceph.com/issues/5453
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 93faca6e 25-Jun-2013 majianpeng <majianpeng@gmail.com>

ceph: Reconstruct the func ceph_reserve_caps.

Drop ignored return value. Fix allocation failure case to not leak.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# 39be95e9 15-May-2013 Jim Schutt <jaschut@sandia.gov>

ceph: ceph_pagelist_append might sleep while atomic

Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
while holding a lock, but it's spoiled because ceph_pagelist_addpage()
always calls kmap(), which might sleep. Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
. . .
[13439.376225] Call Trace:
[13439.378757] [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
[13439.384353] [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491] [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035] [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
[13439.403775] [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
[13439.409277] [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622] [<ffffffff81196748>] ? igrab+0x28/0x70
[13439.420610] [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584] [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499] [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646] [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363] [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250] [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
[13439.461534] [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432] [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
[13439.473934] [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464] [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
[13439.486492] [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
[13439.493190] [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
. . .
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the number
of encoded locks is stable, copy that into a ceph_pagelist.

[elder@inktank.com: abbreviated the stack info a bit.]

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>


# 2f276c51 13-Mar-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use i_release_count to indicate dir's completeness

Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.

This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>


# 3f99969f 28-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: acquire i_mutex in __ceph_do_pending_vmtruncate

make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# a8673d61 18-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag

commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>


# d40ee0dc 17-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com>

ceph: queue cap release when trimming cap

So the client will later send cap release message to MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 92a49fb0 22-Feb-2013 Sage Weil <sage@inktank.com>

ceph: fix statvfs fr_size

Different versions of glibc are broken in different ways, but the short of
it is that for the time being, frsize should == bsize, and be used as the
multiple for the blocks, free, and available fields. This mirrors what is
done for NFS. The previous reporting of the page size for frsize meant
that newer glibc and df would report a very small value for the fs size.

Fixes http://tracker.ceph.com/issues/3793.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>


# 9e0eb85d 06-Feb-2013 Alex Elder <elder@inktank.com>

ceph: remove a few bogus declarations

There are three ceph page vector functions declared in
"fs/ceph/super.h" that don't belong there. They're
probably left over from some long-ago code reorganization.

They're properly declared in "include/linux/ceph/libceph.h"
so just delete the ones in "super.h".

This and the next few commits resolve:
http://tracker.ceph.com/issues/4053

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 05cb11c1 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com>

ceph: Translate between uid and gids in cap messages and kuids and kgids

- Make the uid and gid arguments of send_cap_msg() used to compose
ceph_mds_caps messages of type kuid_t and kgid_t.

- Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
through variables of type kuid_t and kgid_t.

- Modify struct ceph_cap_snap to store uids and gids in types kuid_t
and kgid_t. This allows capturing inode->i_uid and inode->i_gid in
ceph_queue_cap_snap() without loss and pssing them to
__ceph_flush_snaps() where they are removed from struct
ceph_cap_snap and passed to send_cap_msg().

- In handle_cap_grant translate uid and gids in the initial user
namespace stored in struct ceph_mds_cap into kuids and kgids
before setting inode->i_uid and inode->i_gid.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


# 5ef50c3b 31-Jul-2012 Sage Weil <sage@inktank.com>

ceph: simplify+fix atomic_open

The initial ->atomic_open op was carried over from the old intent code,
which was incomplete and didn't really work. Replace it with a fresh
method. In particular:

* always attempt to do an atomic open+lookup, both for the create case
and for lookups of existing files.
* fix symlink handling by returning 1 to the VFS so that we can follow
the link to its destination. This fixes a longstanding ceph bug (#2392).

Signed-off-by: Sage Weil <sage@inktank.com>


# aa711ee3 13-Jul-2012 Alex Elder <elder@inktank.com>

ceph: define snap counts as u32 everywhere

There are two structures in which a count of snapshots are
maintained:

struct ceph_snap_context {
...
u32 num_snaps;
...
}
and
struct ceph_snap_realm {
...
u32 num_prior_parent_snaps; /* had prior to parent_since */
...
u32 num_snaps;
...
}

These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned. Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.

So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>


# 30d90494 21-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

kill struct opendata

Just pass struct file *. Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int. Next: saner prototypes for parts in
namei.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d9585277 21-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

make ->atomic_open() return int

Change of calling conventions:
old new
NULL 1
file 0
ERR_PTR(-ve) -ve

Caller *knows* that struct file *; no need to return it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 47237687 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

->atomic_open() prototype change - pass int * instead of bool *

... and let finish_open() report having opened the file via that sucker.
Next step: don't modify od->filp at all.

[AV: FILE_CREATE was already used by cifs; Miklos' fix folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2d83bde9 05-Jun-2012 Miklos Szeredi <mszeredi@suse.cz>

ceph: implement i_op->atomic_open()

Add an ->atomic_open implementation which replaces the atomic lookup+open+create
operation implemented via ->lookup and ->create operations.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3819219b 05-Jun-2012 Miklos Szeredi <mszeredi@suse.cz>

ceph: remove unused arg from ceph_lookup_open()

What was the purpose of this?

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3ce6cd12 23-Jan-2012 Alex Elder <elder@dreamhost.com>

ceph: avoid repeatedly computing the size of constant vxattr names

All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time. So there's no need to compute their length every
time any file's attribute is listed.

Record the length of each string and use it when needed to determine
the space need to represent them. In addition, compute the
aggregate size of strings in each table just once at initialization
time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# a661fc56 23-Jan-2012 Amon Ott <ao@m-privacy.de>

ceph: use 2 instead of 1 as fallback for 32-bit inode number

The root directory of the Ceph mount has inode number 1, so falling back
to 1 always creates a collision. 2 is unused on my test systems and seems
less likely to collide.

Signed-off-by: Amon Ott <ao@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# a40dc6cc 10-Jan-2012 Sage Weil <sage@newdream.net>

ceph: enable/disable dentry complete flags via mount option

Enable/disable use of the dentry dir 'complete' flag via a mount option.
This lets the admin control whether ceph uses the dcache to satisfy
negative lookups or readdir when it has the entire directory contents in
its cache.

This is purely a performance optimization; correctness is guaranteed
whether it is enabled or not.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# 5706b27d 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk>

ceph: propagate umode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be655596 30-Nov-2011 Sage Weil <sage@newdream.net>

ceph: use i_ceph_lock instead of i_lock

We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction. That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.

Changing the list lock ordering would be a painful process.

However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# c6ffe100 03-Nov-2011 Sage Weil <sage@newdream.net>

ceph: use new D_COMPLETE dentry flag

We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

Signed-off-by: Sage Weil <sage@newdream.net>


# b58dc410 15-Mar-2011 Sage Weil <sage@newdream.net>

ceph: clear parent D_COMPLETE flag when on dentry prune

When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag
on the parent dentry. Do this for the live and snapshotted namespaces. Do
not bother for the .snap dir contents, since we do not cache that.

Signed-off-by: Sage Weil <sage@newdream.net>


# 3310f754 20-Oct-2011 Amon Ott <a.ott@m-privacy.de>

ceph: fix 32-bit ino numbers

Fix 32-bit ino generation to not always be 1.

Signed-off-by: Amon Ott <a.ott@m-privacy.de>


# 83817e35 04-Aug-2011 Sage Weil <sage@newdream.net>

ceph: rename rsize -> rasize

It controls readahead.

Signed-off-by: Sage Weil <sage@newdream.net>


# e5f86dc3 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug

Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.

While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.

Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 5f21c96d 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: protect access to d_parent

d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 468640e3 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: fix ceph_lookup_open intent usage

We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors. So:

- use separate helper for the hidden snapdir translation, immediately
following the mds request
- use ceph_finish_lookup for the final dentry/return value dance in the
exit path
- lookup_instantiate_filp on success

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 9cfa1098 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: use flag bit for at_end readdir flag

This saves us a word of memory per file.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 4918b6d1 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io

This allows us to force IO through the sync path which you normally only
get when multiple clients are reading/writing to the same file or by
mounting with -o sync. Among other things, this lets test programs verify
correctness with a single mount.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 252c6728 26-Jul-2011 Sage Weil <sage@newdream.net>

ceph: add flags field to file_info

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 02c24a82 16-Jul-2011 Josef Bacik <josef@redhat.com>

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 10556cb2 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

->permission() sanitizing: don't pass flags to ->permission()

not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d3d0720d 11-May-2011 Henry C Chang <henry.cy.chang@gmail.com>

ceph: do not use i_wrbuffer_ref as refcount for Fb cap

We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# fca65b4a 04-May-2011 Sage Weil <sage@newdream.net>

ceph: do not call __mark_dirty_inode under i_lock

The __mark_dirty_inode helper now takes i_lock as of 250df6ed. Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>


# 80456f86 10-Mar-2011 Sage Weil <sage@newdream.net>

ceph: move readahead default to fs/ceph from libceph

Signed-off-by: Sage Weil <sage@newdream.net>


# ad1fee96 21-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: add ino32 mount option

The ino32 mount option forces the ceph fs to report 32 bit
ino values. This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>


# 9bde178d 28-Feb-2011 Sage Weil <sage@newdream.net>

Revert "ceph: keep reference to parent inode on ceph_dentry"

This reverts commit 97d79b403ef03f729883246208ef5d8a2ebc4d68.

This fails to account for d_parent changes due to rename or disconnected
dentries due to submounts or NFS reexports.

Signed-off-by: Sage Weil <sage@newdream.net>


# 97d79b40 18-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: keep reference to parent inode on ceph_dentry

When creating a new dentry we now hold a reference to the parent
inode in the ceph_dentry. This is required due to the new RCU
changes from 949854d0, which set dentry->d_parent to NULL in d_kill before
calling the ->release() callback. If/when that behavior is changed, we can
revert this hack.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 6c0f3af7 16-Nov-2010 Sage Weil <sage@newdream.net>

ceph: add dir_layout to inode

Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>


# b74c79e9 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: provide rcu-walk aware permission i_ops

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# cd045cb4 04-Nov-2010 Sage Weil <sage@newdream.net>

ceph: fix rdcache_gen usage and invalidate

We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>


# efa4c120 18-Oct-2010 Sage Weil <sage@newdream.net>

ceph: do not carry i_lock for readdir from dcache

We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work. Since we don't actually need it here anyway, remove
it.

We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.

Signed-off-by: Sage Weil <sage@newdream.net>


# 3d14c5d2 06-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>


# e835124c 17-Sep-2010 Sage Weil <sage@newdream.net>

ceph: only send one flushsnap per cap_snap per mds session

Sending multiple flushsnap messages is problematic because we ignore
the response if the tid doesn't match, and the server may only respond to
each one once. It's also a waste.

So, skip cap_snaps that are already on the flushing list, unless the caller
tells us to resend (because we are reconnecting).

Signed-off-by: Sage Weil <sage@newdream.net>


# ae00d4f3 16-Sep-2010 Sage Weil <sage@newdream.net>

ceph: fix cap_snap and realm split

The cap_snap creation/queueing relies on both the current i_head_snapc
_and_ the i_snap_realm pointers being correct, so that the new cap_snap
can properly reference the old context and the new i_head_snapc can be
updated to reference the new snaprealm's context. To fix this, we:

- move inodes completely to the new (split) realm so that i_snap_realm
is correct, and
- generate the new snapc's _before_ queueing the cap_snaps in
ceph_update_snap_trace().

Signed-off-by: Sage Weil <sage@newdream.net>


# 7d8cb26d 24-Aug-2010 Sage Weil <sage@newdream.net>

ceph: maintain i_head_snapc when any caps are dirty, not just for data

We used to use i_head_snapc to keep track of which snapc the current epoch
of dirty data was dirtied under. It is used by queue_cap_snap to set up
the cap_snap. However, since we queue cap snaps for any dirty caps, not
just for dirty file data, we need to keep a valid i_head_snapc anytime
we have dirty|flushing caps. This fixes a NULL pointer deref in
queue_cap_snap when writing back dirty caps without data (e.g.,
snaptest-authwb.sh).

Signed-off-by: Sage Weil <sage@newdream.net>


# 4a625be4 22-Aug-2010 Sage Weil <sage@newdream.net>

ceph: include dirty xattrs state in snapped caps

When we snapshot dirty metadata that needs to be written back to the MDS,
include dirty xattr metadata. Make the capsnap reference the encoded
xattr blob so that it will be written back in the FLUSHSNAP op.

Also fix the capsnap creation guard to include dirty auth or file bits,
not just tests specific to dirty file data or file writes in progress
(this fixes auth metadata writeback).

Signed-off-by: Sage Weil <sage@newdream.net>


# 52dfb8ac 03-Aug-2010 Sage Weil <sage@newdream.net>

ceph: constify dentry_operations

This makes checkpatch happy.

Signed-off-by: Sage Weil <sage@newdream.net>


# 40819f6f 02-Aug-2010 Greg Farnum <gregf@hq.newdream.net>

ceph: add flock/fcntl lock support

Implement flock inode operation to support advisory file locking. All
lock/unlock operations are synchronous with the MDS. Lock state is
sent when reconnecting to a recovering MDS to restore the shared lock
state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# c6f3fdc5 02-Aug-2010 Greg Farnum <gregf@hq.newdream.net>

ceph: add CEPH_FEATURE_FLOCK to the supported feature bits

This informs the server that we will accept v2 client_caps format and v2
client_reconnect format messages.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# a8b763a9 08-Jul-2010 Sage Weil <sage@newdream.net>

ceph: use %pU to print uuid (fsid)

Signed-off-by: Sage Weil <sage@newdream.net>


# 6a259382 07-Jul-2010 Sage Weil <sage@newdream.net>

ceph: specify supported features in super.h

Specify the supported/required feature bits in super.h client code instead
of using the definitions from the shared kernel/userspace headers (which
will go away shortly).

Signed-off-by: Sage Weil <sage@newdream.net>


# 2bc50259 30-Jun-2010 Greg Farnum <gregf@hq.newdream.net>

ceph: add ceph_get_cap_for_mds function.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 37151668 17-Jun-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: do caps accounting per mds_client

Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 984c7690 23-May-2010 Andrea Gelmini <andrea.gelmini@gelma.net>

ceph: removed duplicated #includes

fs/ceph/auth.c: linux/slab.h is included more than once.
fs/ceph/super.h: linux/slab.h is included more than once.

Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 7ea80859 26-May-2010 Christoph Hellwig <hch@lst.de>

drop unused dentry argument to ->fsync

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 23804d91 14-May-2010 Sage Weil <sage@newdream.net>

ceph: specify max_bytes on readdir replies

Specify max bytes in request to bound size of reply. Add associated
mount option with default value of 512 KB.

Signed-off-by: Sage Weil <sage@newdream.net>


# 6e19a16e 29-Apr-2010 Sage Weil <sage@newdream.net>

ceph: clean up mount options, ->show_options()

Ensure all options are included in /proc/mounts. Some cleanup.

Signed-off-by: Sage Weil <sage@newdream.net>


# 640ef79d 26-Mar-2010 Cheng Renquan <crquan@gmail.com>

ceph: use ceph_sb_to_client instead of ceph_client

ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.

-static inline struct ceph_client *ceph_client(struct super_block *sb)
-{
- return sb->s_fs_info;
-}

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 81a6cf2d 14-May-2010 Sage Weil <sage@newdream.net>

ceph: invalidate affected dentry leases on aborted requests

If we abort a request, we return to caller, but the request may still
complete. And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request. A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').

Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases. Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.

Signed-off-by: Sage Weil <sage@newdream.net>


# d45d0d97 20-Apr-2010 Sage Weil <sage@newdream.net>

ceph: add missing #includes

Signed-off-by: Sage Weil <sage@newdream.net>


# fc837c8f 13-Apr-2010 Sage Weil <sage@newdream.net>

ceph: queue_cap_snap should always queue dirty context

This simplifies the calling convention, and fixes a bug where we queue a
capsnap with a context other than i_head_snapc (the one that matches the
dirty pages). The result was a BUG at fs/ceph/caps.c:2178 on writeback
completion when a capsnap matching the writeback snapc could not be found.

Signed-off-by: Sage Weil <sage@newdream.net>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# f1a3d572 17-Jan-2010 Stephen Rothwell <sfr@canb.auug.org.au>

ceph: update for write_inode API change

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sage Weil <sage@newdream.net>


# 422d2cb8 26-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: reset osd after relevant messages timed out

This simplifies the process of timing out messages. We
keep lru of current messages that are in flight. If a
timeout has passed, we reset the osd connection, so that
messages will be retransmitted. This is a failsafe in case
we hit some sort of problem sending out message to the OSD.
Normally, we'll get notification via an updated osdmap if
there are problems.

If a request is older than the keepalive timeout, send a
keepalive to ensure we detect any breaks in the TCP connection.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# e9964c10 01-Mar-2010 Sage Weil <sage@newdream.net>

ceph: fix flush_dirty_caps race with caps migration

The flush_dirty_caps() used to loop over the first entry of the cap_dirty
dirty list on the assumption that after calling ceph_check_caps() it would
be removed from the list. This isn't true for caps that are being
migrated between MDSs, where we've received the EXPORT but not the IMPORT.

Instead, do a safe list iteration, and pin the next inode on the list via
the CEPH_I_NOFLUSH flag.

Signed-off-by: Sage Weil <sage@newdream.net>


# 2600d2dd 22-Feb-2010 Sage Weil <sage@newdream.net>

ceph: drop messages on unregistered mds sessions; cleanup

Verify the mds session is currently registered before handling
incoming messages. Clean up message handlers to pull mds out
of session->s_mds instead of less trustworthy src field.

Clean up con_{get,put} debug output.

Signed-off-by: Sage Weil <sage@newdream.net>


# 7c1332b8 16-Feb-2010 Sage Weil <sage@newdream.net>

ceph: fix iterate_caps removal race

We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.

Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.

While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.

Signed-off-by: Sage Weil <sage@newdream.net>


# 85ccce43 17-Feb-2010 Sage Weil <sage@newdream.net>

ceph: clean up readdir caps reservation

Use a global counter for the minimum number of allocated caps instead of
hard coding a check against readdir_max. This takes into account multiple
client instances, and avoids examining the superblock mount options when a
cap is dropped.

Signed-off-by: Sage Weil <sage@newdream.net>


# a105f00c 15-Feb-2010 Sage Weil <sage@newdream.net>

ceph: use rbtree for snap_realms

Switch from radix tree to rbtree for snap realms. This is much more
appropriate given that realm keys are few and far between.

Signed-off-by: Sage Weil <sage@newdream.net>


# 3c6f6b79 09-Feb-2010 Sage Weil <sage@newdream.net>

ceph: cleanup async writeback, truncation, invalidate helpers

Grab inode ref in helper. Make work functions static, with consistent
naming.

Signed-off-by: Sage Weil <sage@newdream.net>


# f5a2041b 03-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: put unused osd connections on lru

Instead of removing osd connection immediately when the
requests list is empty, put the osd connection on an lru.
Only if that osd has not been used for more than a specified
time, will it be removed.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 9bd2e6f8 02-Feb-2010 Sage Weil <sage@newdream.net>

ceph: allow renewal of auth credentials

Add infrastructure to allow the mon_client to periodically renew its auth
credentials. Also add a messenger callback that will force such a renewal
if a peer rejects our authenticator.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 2baba250 18-Dec-2009 Yehuda Sadeh <yehuda@hq.newdream.net>

ceph: writeback congestion control

Set bdi congestion bit when amount of write data in flight exceeds adjustable
threshold.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>


# 06edf046 15-Dec-2009 Sage Weil <sage@newdream.net>

ceph: include link to bdi in debugfs

Signed-off-by: Sage Weil <sage@newdream.net>


# 0743304d 18-Nov-2009 Sage Weil <sage@newdream.net>

ceph: fix debugfs entry, simplify fsid checks

We may first learn our fsid from any of the mon, osd, or mds maps
(whichever the monitor sends first). Consolidate checks in a single
helper. Initialize the client debugfs entry then, since we need the
fsid (and global_id) for the directory name.

Also remove dead mount code.

Signed-off-by: Sage Weil <sage@newdream.net>


# 4e7a5dcd 18-Nov-2009 Sage Weil <sage@newdream.net>

ceph: negotiate authentication protocol; implement AUTH_NONE protocol

When we open a monitor session, we send an initial AUTH message listing
the auth protocols we support, our entity name, and (possibly) a previously
assigned global_id. The monitor chooses a protocol and responds with an
initial message.

Initially implement AUTH_NONE, a dummy protocol that provides no security,
but works within the new framework. It generates 'authorizers' that are
used when connecting to (mds, osd) services that simply state our entity
name and global_id.

This is a wire protocol change.

Signed-off-by: Sage Weil <sage@newdream.net>


# 039934b8 12-Nov-2009 Sage Weil <sage@newdream.net>

ceph: build cleanly without CONFIG_DEBUG_FS

Signed-off-by: Sage Weil <sage@newdream.net>


# cdac8303 10-Nov-2009 Sage Weil <sage@newdream.net>

ceph: remove recon_gen logic

We don't get an explicit affirmative confirmation that our caps reconnect,
nor do we necessarily want to pay that cost. So, take all this code out
for now.

Signed-off-by: Sage Weil <sage@newdream.net>


# 685f9a5d 09-Nov-2009 Sage Weil <sage@newdream.net>

ceph: do not confuse stale and dead (unreconnected) caps

We were using the cap_gen to track both stale caps (caps that timed out
due to temporarily losing touch with the mds) and dead caps that did not
reconnect after an MDS failure. Introduce a recon_gen counter to track
reconnections to restarted MDSs and kill dead caps based on that instead.

Rename gen to cap_gen while we're at it to make it more clear which is
which.

Signed-off-by: Sage Weil <sage@newdream.net>


# fbbccec9 28-Oct-2009 Noah Watkins <noah@noahdesu.com>

ceph: replace list_entry with container_of

Usage of non-list.h list_entry function for container_of
functionality replaced with direct use of container_of.

Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>


# 6b805185 27-Oct-2009 Sage Weil <sage@newdream.net>

ceph: allocate and parse mount args before client instance

This simplifies much of the error handling during mount. It also means
that we have the mount args before client creation, and we can initialize
based on those options.

Signed-off-by: Sage Weil <sage@newdream.net>


# ecb19c46 22-Oct-2009 Sage Weil <sage@newdream.net>

ceph: remove small mon addr limit; use CEPH_MAX_MON where appropriate

Get rid of separate max mon limit; use the system limit instead. This
allows mounts when there are lots of mon addrs provided by mount.ceph (as
with a host with lots of A/AAAA records).

Signed-off-by: Sage Weil <sage@newdream.net>


# 8fa97655 16-Oct-2009 Sage Weil <sage@newdream.net>

ceph: enable readahead

Initialized bdi->ra_pages to enable readahead. Use 512KB default.

Signed-off-by: Sage Weil <sage@newdream.net>


# afcdaea3 14-Oct-2009 Sage Weil <sage@newdream.net>

ceph: flush dirty caps via the cap_dirty list

Previously we were flushing dirty caps by passing an extra flag
when traversing the delayed caps list. Besides being a bit ugly,
that can also miss caps that are dirty but didn't result in a
cap requeue: notably, mark_caps_dirty().

Separate the flushing into a separate helper, and traverse the
cap_dirty list.

This also brings i_dirty_item in line with i_dirty_caps: we are
on the list IFF caps != 0. We carry an inode ref IFF
dirty_caps|flushing_caps != 0.

Lose the unused return value from __ceph_mark_caps_dirty().

Signed-off-by: Sage Weil <sage@newdream.net>


# de57606c 06-Oct-2009 Sage Weil <sage@newdream.net>

ceph: client types

We first define constants, types, and prototypes for the kernel client
proper.

A few subsystems are defined separately later: the MDS, OSD, and
monitor clients, and the messaging layer.

Signed-off-by: Sage Weil <sage@newdream.net>