History log of /linux-master/include/linux/dcache.h
Revision Date Author Comments
# 11b3f8ae 02-Feb-2024 Amir Goldstein <amir73il@gmail.com>

fs: remove the inode argument to ->d_real() method

The only remaining user of ->d_real() method is d_real_inode(), which
passed NULL inode argument to get the real data dentry.

There are no longer any users that call ->d_real() with a non-NULL
inode argument for getting a detry from a specific underlying layer.

Remove the inode argument of the method and replace it with an integer
'type' argument, to allow callers to request the real metadata dentry
instead of the real data dentry.

All the current users of d_real_inode() (e.g. uprobe) continue to get
the real data inode. Caller that need to get the real metadata inode
(e.g. IMA/EVM) can use d_inode(d_real(dentry, D_REAL_METADATA)).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202110132.1584111-3-amir73il@gmail.com
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 7e4a205f 03-Feb-2024 Al Viro <viro@zeniv.linux.org.uk>

Revert "get rid of DCACHE_GENOCIDE"

This reverts commit 57851607326a2beef21e67f83f4f53a90df8445a.

Unfortunately, while we only call that thing once, the callback
*can* be called more than once for the same dentry - all it
takes is rename_lock being touched while we are in d_walk().
For now let's revert it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1b6ae9f6 06-Nov-2023 Vegard Nossum <vegard.nossum@oracle.com>

dcache: remove unnecessary NULL check in dget_dlock()

dget_dlock() requires dentry->d_lock to be held when called, yet
contains a NULL check for dentry.

An audit of all calls to dget_dlock() shows that it is never called
with a NULL pointer (as spin_lock()/spin_unlock() would crash in these
cases):

$ git grep -W '\<dget_dlock\>'

arch/powerpc/platforms/cell/spufs/inode.c- spin_lock(&dentry->d_lock);
arch/powerpc/platforms/cell/spufs/inode.c- if (simple_positive(dentry)) {
arch/powerpc/platforms/cell/spufs/inode.c: dget_dlock(dentry);

fs/autofs/expire.c- spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
fs/autofs/expire.c- if (simple_positive(child)) {
fs/autofs/expire.c: dget_dlock(child);

fs/autofs/root.c: dget_dlock(active);
fs/autofs/root.c- spin_unlock(&active->d_lock);

fs/autofs/root.c: dget_dlock(expiring);
fs/autofs/root.c- spin_unlock(&expiring->d_lock);

fs/ceph/dir.c- if (!spin_trylock(&dentry->d_lock))
fs/ceph/dir.c- continue;
[...]
fs/ceph/dir.c: dget_dlock(dentry);

fs/ceph/mds_client.c- spin_lock(&alias->d_lock);
[...]
fs/ceph/mds_client.c: dn = dget_dlock(alias);

fs/configfs/inode.c- spin_lock(&dentry->d_lock);
fs/configfs/inode.c- if (simple_positive(dentry)) {
fs/configfs/inode.c: dget_dlock(dentry);

fs/libfs.c: found = dget_dlock(d);
fs/libfs.c- spin_unlock(&d->d_lock);

fs/libfs.c: found = dget_dlock(child);
fs/libfs.c- spin_unlock(&child->d_lock);

fs/libfs.c: child = dget_dlock(d);
fs/libfs.c- spin_unlock(&d->d_lock);

fs/ocfs2/dcache.c: dget_dlock(dentry);
fs/ocfs2/dcache.c- spin_unlock(&dentry->d_lock);

include/linux/dcache.h:static inline struct dentry *dget_dlock(struct dentry *dentry)

After taking out the NULL check, dget_dlock() becomes almost identical
to __dget_dlock(); the only difference is that dget_dlock() returns the
dentry that was passed in. These are static inline helpers, so we can
rely on the compiler to discard unused return values. We can therefore
also remove __dget_dlock() and replace calls to it by dget_dlock().

Also fix up and improve the kerneldoc comments while we're at it.

Al Viro pointed out that we can also clean up some of the callers to
make use of the returned value and provided a bit more info for the
kerneldoc.

While preparing v2 I also noticed that the tabs used in the kerneldoc
comments were causing the kerneldoc to get parsed incorrectly so I also
fixed this up (including for d_unhashed, which is otherwise unrelated).

Testing: x86 defconfig build + boot; make htmldocs for the kerneldoc
warning. objdump shows there are code generation changes.

Link: https://lore.kernel.org/all/20231022164520.915013-1-vegard.nossum@oracle.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1b327b5a 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

kill DCACHE_MAY_FREE

With the new ordering in __dentry_kill() it has become redundant -
it's set if and only if both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST
are set.

We set it in __dentry_kill(), after having set DCACHE_DENTRY_KILLED
with the only condition being that DCACHE_SHRINK_LIST is there;
all of that is done without dropping ->d_lock and the only place
that checks that flag (shrink_dentry_list()) does so under ->d_lock,
after having found the victim on its shrink list. Since DCACHE_SHRINK_LIST
is set only when placing dentry into shrink list and removed only by
shrink_dentry_list() itself, a check for DCACHE_DENTRY_KILLED in
there would be equivalent to check for DCACHE_MAY_FREE.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 57851607 12-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

get rid of DCACHE_GENOCIDE

... now that we never call d_genocide() other than from kill_litter_super()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8a54b38f 11-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

d_genocide(): move the extern into fs/internal.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f2824db1 18-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller

now that the only user of d_instantiate_anon() is gone...
[braino fix folded - kudos to Dan Carpenter]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# da549bdd 07-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

dentry: switch the lists of children to hlist

Saves a pointer per struct dentry and actually makes the things less
clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 698f1e2b 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

kill d_backing_dentry()

no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2fcd38f4 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

[software coproarchaeology] dentry.h: kill a mysterious comment

there's a strange comment in front of d_lookup() declaration:

/* appendix may either be NULL or be used for transname suffixes */

Looks like nobody had been curious enough to track its history;
it predates git, it predates bitkeeper and if you look through
the pre-BK trees, you finally arrive at this in 2.1.44-for-davem:
/* appendix may either be NULL or be used for transname suffixes */
-extern struct dentry * d_lookup(struct inode * dir, struct qstr * name,
- struct qstr * appendix);
+extern struct dentry * d_lookup(struct dentry * dir, struct qstr * name);
In other words, it refers to the third argument d_lookup() used to have
back then. It had been introduced in 2.1.43-pre, on June 12 1997,
along with d_lookup(), only to be removed by July 4 1997, presumably
when the Cthulhu-awful thing it used to be used for (look for
CONFIG_TRANS_NAMES in 2.1.43-pre, and keep a heavy-duty barfbag
ready) had been, er, noticed and recognized for what it had been.

Despite the appendectomy, the comment remained. Some things really
need to be put out of their misery...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0d486510 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

dentry.h: trim externs

d_instantiate_unique() had been gone for 7 years; __d_lookup...()
and shrink_dcache_for_umount() are fs/internal.h fodder.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8219cb58 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

kill d_{is,set}_fallthru()

Introduced in 2015 and never had any in-tree users...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0bec65a8 09-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

DCACHE_COOKIE: RIP

the last user gone in 2021...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 641c3ef5 09-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

DCACHE_... ->d_flags bits: switch to BIT()

For bits 20..22 (inode type cached in ->d_flags) turn the definitions into
expressions like (5 << 20); everything else turns into straight use of
BIT()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# acfde6e8 03-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

struct dentry: get rid of randomize_layout idiocy

This is beyond ridiculous. There is a reason why that thing is
cacheline-aligned...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 771eb4fe 09-Jul-2018 Kent Overstreet <kent.overstreet@gmail.com>

fs: factor out d_mark_tmpfile()

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>


# 863f144f 23-Sep-2022 Miklos Szeredi <mszeredi@redhat.com>

vfs: open inside ->tmpfile()

This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0f60d288 30-Jan-2022 Al Viro <viro@zeniv.linux.org.uk>

dynamic_dname(): drop unused dentry argument

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4f48d5da 15-May-2022 Xiubo Li <xiubli@redhat.com>

fs/dcache: export d_same_name() helper

Compare dentry name with case-exact name, return true if names
are same, or false.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>


# 45f78b0a 27-Jul-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

fs/dcache: Move the wakeup from __d_lookup_done() to the caller.

__d_lookup_done() wakes waiters on dentry->d_wait. On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.

Calling it under dentry->d_lock is not a problem, since that is also a
"sleeping" spinlock on the same configs. Unfortunately, two of its
callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.

The key observation is that wakeup can be moved to any point before
dropping ->d_lock.

As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.

This is safe because:

Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags. As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).

->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry. Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.

With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.

One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata. It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.

Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry. Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.

Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock. That concludes the analysis.

Let __d_lookup_unhash():

1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
2) Unhash the dentry
3) Retrieve and clear dentry::d_wait
4) Unlock the hash and return the retrieved waitqueue head pointer
5) Let the caller handle the wake up.
6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
build failures for OOT code that used __d_lookup_done() and is not
aware of the new return value.

This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.

An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c8c0c239 21-Jan-2022 Luis Chamberlain <mcgrof@kernel.org>

fs: move dcache sysctls to its own file

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the dcache sysctl clutter out of kernel/sysctl.c. This is a
small one-off entry, perhaps later we can simplify this representation,
but for now we use the helpers we have. We won't know how we can
simplify this further untl we're fully done with the cleanup.

[arnd@arndb.de: avoid unused-function warning]
Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80e5d1ff5 15-Apr-2021 Al Viro <viro@zeniv.linux.org.uk>

useful constants: struct qstr for ".."

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a2bbe664 07-Jul-2019 Al Viro <viro@zeniv.linux.org.uk>

constify dentry argument of dentry_path()/dentry_path_raw()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# bca585d2 05-Jan-2021 Al Viro <viro@zeniv.linux.org.uk>

new helper: d_find_alias_rcu()

similar to d_find_alias(inode), except that
* the caller must be holding rcu_read_lock()
* inode must not be freed until matching rcu_read_unlock()
* result is *NOT* pinned and can only be dereferenced until
the matching rcu_read_unlock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# aa6159ab 15-Dec-2020 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

kernel.h: split out mathematical helpers

kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out
mathematical helpers.

At the same time convert users in header and lib folder to use new
header. Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.

[sfr@canb.auug.org.au: fix powerpc build]
Link: https://lkml.kernel.org/r/20201029150809.13059608@canb.auug.org.au

Link: https://lkml.kernel.org/r/20201028173212.41768-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 501e43fb 23-Sep-2020 Eric Biggers <ebiggers@google.com>

fscrypt: rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAME

Originally we used the term "encrypted name" or "ciphertext name" to
mean the encoded filename that is shown when an encrypted directory is
listed without its key. But these terms are ambiguous since they also
mean the filename stored on-disk. "Encrypted name" is especially
ambiguous since it could also be understood to mean "this filename is
encrypted on-disk", similar to "encrypted file".

So we've started calling these encoded names "no-key names" instead.

Therefore, rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAME to avoid
confusion about what this flag means.

Link: https://lore.kernel.org/r/20200924042624.98439-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>


# 26475371 20-Jul-2020 Ahmed S. Darwish <a.darwish@linutronix.de>

vfs: Use sequence counter with associated spinlock

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-19-a.darwish@linutronix.de


# 2c567af4 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>

fs: Introduce DCACHE_DONTCACHE

DCACHE_DONTCACHE indicates a dentry should not be cached on final
dput().

Also add a helper function to mark DCACHE_DONTCACHE on all dentries
pointing to a specific inode when that inode is being set I_DONTCACHE.

This facilitates dropping dentry references to inodes sooner which
require eviction to swap S_DAX mode.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# d41efb52 04-Nov-2019 Al Viro <viro@zeniv.linux.org.uk>

fs/namei.c: pull positivity check into follow_managed()

There are 4 callers; two proceed to check if result is positive and
fail with ENOENT if it isn't; one (in handle_lookup_down()) is
guaranteed to yield positive and one (in lookup_fast()) is _preceded_
by positivity check.

However, follow_managed() on a negative dentry is a (fairly cheap)
no-op on anything other than autofs. And negative autofs dentries
are never hashed, so lookup_fast() is not going to run into one
of those. Moreover, successful follow_managed() on a _positive_
dentry never yields a negative one (and we significantly rely upon
that in callers of lookup_fast()).

In other words, we can easily transpose the positivity check and
the call of follow_managed() in lookup_fast(). And that allows
to fold the positivity check *into* follow_managed(), simplifying
life for the code downstream of its calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ec23eb54 26-Jul-2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

docs: fs: convert docs without extension to ReST

There are 3 remaining files without an extension inside the fs docs
dir.

Manually convert them to ReST.

In the case of the nfs/exporting.rst file, as the nfs docs
aren't ported yet, I opted to convert and add a :orphan: there,
with should be removed when it gets added into a nfs-specific
part of the fs documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# 5c437fa2 07-Jun-2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

docs: fs: fix broken links to vfs.txt with was renamed to vfs.rst

A recent documentation conversion renamed this file but forgot
to update the links.

Fixes: af96c1e304f7 ("docs: filesystems: vfs: Convert vfs.txt to RST")
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>


# 7e5f7bb0 20-May-2019 Al Viro <viro@zeniv.linux.org.uk>

unexport simple_dname()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 230c6402 26-Apr-2019 Al Viro <viro@zeniv.linux.org.uk>

ovl_lookup_real_one(): don't bother with strlen()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6cc24868 20-Mar-2019 Eric Biggers <ebiggers@google.com>

fscrypt: clean up and improve dentry revalidation

Make various improvements to fscrypt dentry revalidation:

- Don't try to handle the case where the per-directory key is removed,
as this can't happen without the inode (and dentries) being evicted.

- Flag ciphertext dentries rather than plaintext dentries, since it's
ciphertext dentries that need the special handling.

- Avoid doing unnecessary work for non-ciphertext dentries.

- When revalidating ciphertext dentries, try to set up the directory's
i_crypt_info to make sure the key is really still absent, rather than
invalidating all negative dentries as the previous code did. An old
comment suggested we can't do this for locking reasons, but AFAICT
this comment was outdated and it actually works fine.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>


# ab1152dd 15-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

unexport d_alloc_pseudo()

No modular uses since introducion of alloc_file_pseudo(),
and the only non-modular user not in alloc_file_pseudo()
had actually been wrong - should've been d_alloc_anon().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5467a68c 15-Mar-2019 Al Viro <viro@zeniv.linux.org.uk>

dcache: sort the freeing-without-RCU-delay mess for good.

For lockless accesses to dentries we don't have pinned we rely
(among other things) upon having an RCU delay between dropping
the last reference and actually freeing the memory.

On the other hand, for things like pipes and sockets we neither
do that kind of lockless access, nor want to deal with the
overhead of an RCU delay every time a socket gets closed.

So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
made sure it would happen. We tried to avoid setting it unless
we knew we need it. Unfortunately, that had led to recurring
class of bugs, in which we missed the need to set it.

We only really need it for dentries that are created by
d_alloc_pseudo(), so let's not bother with trying to be smart -
just make having an RCU delay the default. The ones that do
*not* get it set the replacement flag (DCACHE_NORCU) and we'd
better use that sparingly. d_alloc_pseudo() is the only
such user right now.

FWIW, the race that finally prompted that switch had been
between __lock_parent() of immediate subdirectory of what's
currently the root of a disconnected tree (e.g. from
open-by-handle in progress) racing with d_splice_alias()
elsewhere picking another alias for the same inode, either
on outright corrupted fs image, or (in case of open-by-handle
on NFS) that subdirectory having been just moved on server.
It's not easy to hit, so the sky is not falling, but that's
not the first race on similar missed cases and the logics
for settinf DCACHE_RCUACCESS has gotten ridiculously
convoluted.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# af0c9af1 30-Jan-2019 Waiman Long <longman@redhat.com>

fs/dcache: Track & report number of negative dentries

The current dentry number tracking code doesn't distinguish between
positive & negative dentries. It just reports the total number of
dentries in the LRU lists.

As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.

This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file. The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.

The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.

Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension. They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero. IOW, it will be safe to use one of the dummy array entry for
negative dentry count.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c971e6a0 28-May-2018 Al Viro <viro@zeniv.linux.org.uk>

kill d_instantiate_no_diralias()

The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success. ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fb16043b 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

vfs: remove open_flags from d_real()

Opening regular files on overlayfs is now handled via ovl_open(). Remove
the now unused "open_flags" argument from d_op->d_real() and the d_real()
helper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 4ab30319 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

Revert "vfs: add flags to d_real()"

This reverts commit 495e642939114478a5237a7d91661ba93b76f15a.

No user of "flags" argument of d_real() remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 88059de1 18-Jul-2018 Miklos Szeredi <mszeredi@redhat.com>

Revert "ovl: fix relatime for directories"

This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260.

Overlayfs no longer relies on the vfs correct atime handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 63a67a92 23-Jun-2018 Al Viro <viro@zeniv.linux.org.uk>

kill dentry_update_name_case()

the last user is gone

Spotted-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1e2e547a 04-May-2018 Al Viro <viro@zeniv.linux.org.uk>

do d_instantiate/unlock_new_inode combinations safely

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 903ddaf4 06-Mar-2018 Al Viro <viro@zeniv.linux.org.uk>

take out orphan externs (empty_string/slash_string)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f9c34674 19-Jan-2018 Miklos Szeredi <miklos@szeredi.hu>

vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()

Those helpers are going to be used by overlayfs to implement
NFS export decode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 66702eb5 23-Oct-2017 Mark Rutland <mark.rutland@arm.com>

locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the dcache code and comments to use
{READ,WRITE}_ONCE() consistently.

----
virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4ded097b 19-Oct-2017 NeilBrown <neilb@suse.com>

constify more dcache.h inlined helpers.

const struct pointers in commit f0d3b3ded999 ("constify dcache.c
inlined helpers where possible").
This patch allows 'const' in a couple that were added since then.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cd91304e 04-Sep-2017 Miklos Szeredi <mszeredi@redhat.com>

ovl: fix relatime for directories

Need to treat non-regular overlayfs files the same as regular files when
checking for an atime update.

Add a d_real() flag to make it return the upper dentry for all file types.

Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 495e6429 04-Sep-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: add flags to d_real()

Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 0a2c13d9 12-Jul-2017 Stephen Rothwell <sfr@canb.auug.org.au>

include/linux/dcache.h: use unsigned chars in struct name_snapshot

"kernel.h: handle pointers to arrays better in container_of()" triggers:

In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/linux/syscalls.h:71,
from fs/dcache.c:17:
fs/dcache.c: In function 'release_dentry_name_snapshot':
include/linux/compiler.h:542:38: error: call to '__compiletime_assert_305' declared with attribute error: pointer type mismatch in container_of()
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/compiler.h:525:4: note: in definition of macro '__compiletime_assert'
prefix ## suffix(); \
^
include/linux/compiler.h:542:2: note: in expansion of macro '_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/build_bug.h:46:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^
include/linux/kernel.h:860:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
^
fs/dcache.c:305:7: note: in expansion of macro 'container_of'
p = container_of(name->name, struct external_name, name[0]);

Switch name_snapshot to use unsigned chars, matching struct qstr and
struct external_name.

Link: http://lkml.kernel.org/r/20170710152134.0f78c1e6@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 49d31c2f 07-Jul-2017 Al Viro <viro@zeniv.linux.org.uk>

dentry name snapshots

take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
struct name_snapshot s;

take_dentry_name_snapshot(&s, dentry);
...
access s.name
...
release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cdf01226 04-Jul-2017 David Howells <dhowells@redhat.com>

VFS: Provide empty name qstr

Provide an empty name (ie. "") qstr for general use.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3859a271 28-Oct-2016 Kees Cook <keescook@chromium.org>

randstruct: Mark various structs for randomization

This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>


# f9411ebe 06-Feb-2017 Ingo Molnar <mingo@kernel.org>

rcu: Separate the RCU synchronization types and APIs into <linux/rcupdate_wait.h>

So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.

Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.

Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.

( For rcutiny this means that two dependent APIs have to be uninlined,
but that shouldn't be much of a problem as they are rare variants. )

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 03440c4e 27-Feb-2017 Masahiro Yamada <yamada.masahiro@socionext.com>

scripts/spelling.txt: add "an union" pattern and fix typo instances

Fix typos and add the following to the scripts/spelling.txt:

an union||a union

Link: http://lkml.kernel.org/r/1481573103-11329-5-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f74e7b33 23-Nov-2016 Ian Kent <ikent@redhat.com>

vfs: remove unused have_submounts() function

Now that path_has_submounts() has been added have_submounts() is no
longer used so remove it.

Link: http://lkml.kernel.org/r/20161011053428.27645.12310.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 01619491 23-Nov-2016 Ian Kent <ikent@redhat.com>

vfs: add path_has_submounts()

d_mountpoint() can only be used reliably to establish if a dentry is
not mounted in any namespace. It isn't aware of the possibility there
may be multiple mounts using the given dentry, possibly in a different
namespace.

Add function, path_has_submounts(), that checks is a struct path contains
mounts (or is a mountpoint itself) to handle this case.

Link: http://lkml.kernel.org/r/20161011053403.27645.55242.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fb5f51c7 23-Nov-2016 Ian Kent <ikent@redhat.com>

vfs: change d_manage() to take a struct path

For the autofs module to be able to reliably check if a dentry is a
mountpoint in a multiple namespace environment the ->d_manage() dentry
operation will need to take a path argument instead of a dentry.

Link: http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7b1742eb 15-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: make argument of d_real_inode() const

d_op->d_real() leaves the dentry alone except if the third argument is
non-zero. Unfortunately very difficult to explain to the compiler without
a cast.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>


# 6fa67e70 31-Jul-2016 Al Viro <viro@zeniv.linux.org.uk>

get rid of 'parent' argument of ->d_compare()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 285b102d 28-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: new d_init method

Allow filesystem to initialize dentry at allocation time.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9aba36de 20-Jul-2016 Al Viro <viro@zeniv.linux.org.uk>

qstr constify instances in fs/dcache.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e698b8a4 30-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: document ->d_real()

Add missing documentation for the d_op->d_real() method and d_real()
helper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 2d902671 30-Jun-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: merge .d_select_inode() into .d_real()

The two methods essentially do the same: find the real dentry/inode
belonging to an overlay dentry. The difference is in the usage:

vfs_open() uses ->d_select_inode() and expects the function to perform
copy-up if necessary based on the open flags argument.

file_dentry() uses ->d_real() passing in the overlay dentry as well as the
underlying inode.

vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
with a zero inode would have worked just as well here.

This patch merges the functionality of ->d_select_inode() into ->d_real()
by adding an 'open_flags' argument to the latter.

[Al Viro] Make the signature of d_real() match that of ->d_real() again.
And constify the inode argument, while we are at it.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# ba65dc5e 10-Jun-2016 Al Viro <viro@zeniv.linux.org.uk>

much milder d_walk() race

d_walk() relies upon the tree not getting rearranged under it without
rename_lock being touched. And we do grab rename_lock around the
places that change the tree topology. Unfortunately, branch reordering
is just as bad from d_walk() POV and we have two places that do it
without touching rename_lock - one in handling of cursors (for ramfs-style
directories) and another in autofs. autofs one is a separate story; this
commit deals with the cursors.
* mark cursor dentries explicitly at allocation time
* make __dentry_kill() leave ->d_child.next pointing to the next
non-cursor sibling, making sure that it won't be moved around unnoticed
before the parent is relocked on ascend-to-parent path in d_walk().
* make d_walk() skip cursors explicitly; strictly speaking it's
not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
but it makes analysis easier.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f4bcbe79 20-May-2016 George Spelvin <linux@sciencehorizons.net>

Pull out string hash to <linux/stringhash.h>

... so they can be used without the rest of <linux/dcache.h>

The hashlen_* macros will make sense next patch.

Signed-off-by: George Spelvin <linux@sciencehorizons.net>


# a1180844 20-May-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: add d_real_inode() helper

Needed by the following fix.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>


# 54d5ca87 10-May-2016 Miklos Szeredi <mszeredi@redhat.com>

vfs: add vfs_select_inode() helper

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+


# d9171b93 15-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

parallel lookups machinery, part 4 (and last)

If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup. Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup(). Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away. If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held. Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 94bdd655 15-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

parallel lookups machinery, part 3

We will need to be able to check if there is an in-lookup
dentry with matching parent/name. Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those. Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain. On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway. That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel(). Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry. In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish. Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review. All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 85c7f810 14-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

beginning of transition to parallel lookups - marking in-lookup dentries

marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
* __d_add() is about to make it hashed, positive or not.
* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
* in caller of ->lookup() if it has escaped all of the
above. Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
* remain negative unhashed with positive refcount
* never have its ->d_alias looked at
* never have its ->d_lru looked at
* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
* only exist when parent is locked shared
* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series. The
reason for not making it a single patch is to simplify review.

New primitives: d_in_lookup() (a predicate checking if dentry is in
the in-lookup state) and d_lookup_done() (tells the system that
we are done with lookup and if it's still marked as in-lookup, it
should cease to be such).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d101a125 26-Mar-2016 Miklos Szeredi <miklos@szeredi.hu>

fs: add file_dentry()

This series fixes bugs in nfs and ext4 due to 4bacc9c9234c ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

Regular files opened on overlayfs will result in the file being opened on
the underlying filesystem, while f_path points to the overlayfs
mount/dentry.

This confuses filesystems which get the dentry from struct file and assume
it's theirs.

Add a new helper, file_dentry() [*], to get the filesystem's own dentry
from the file. This checks file->f_path.dentry->d_flags against
DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
set (this is the common, non-overlayfs case).

In the uncommon case it will call into overlayfs's ->d_real() to get the
underlying dentry, matching file_inode(file).

The reason we need to check against the inode is that if the file is copied
up while being open, d_real() would return the upper dentry, while the open
file comes from the lower dentry.

[*] If possible, it's better simply to use file_inode() instead.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: <stable@vger.kernel.org> # v4.2
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Axtens <dja@axtens.net>


# 0b81d077 15-May-2015 Jaegeuk Kim <jaegeuk@kernel.org>

fs crypto: move per-file encryption from f2fs tree to fs/crypto

This patch adds the renamed functions moved from the f2fs crypto files.

1. definitions for per-file encryption used by ext4 and f2fs.

2. crypto.c for encrypt/decrypt functions
a. IO preparation:
- fscrypt_get_ctx / fscrypt_release_ctx
b. before IOs:
- fscrypt_encrypt_page
- fscrypt_decrypt_page
- fscrypt_zeroout_range
c. after IOs:
- fscrypt_decrypt_bio_pages
- fscrypt_pullback_bio_page
- fscrypt_restore_control_page

3. policy.c supporting context management.
a. For ioctls:
- fscrypt_process_policy
- fscrypt_get_policy
b. For context permission
- fscrypt_has_permitted_context
- fscrypt_inherit_context

4. keyinfo.c to handle permissions
- fscrypt_get_encryption_info
- fscrypt_free_encryption_info

5. fname.c to support filename encryption
a. general wrapper functions
- fscrypt_fname_disk_to_usr
- fscrypt_fname_usr_to_disk
- fscrypt_setup_filename
- fscrypt_free_filename

b. specific filename handling functions
- fscrypt_fname_alloc_buffer
- fscrypt_fname_free_buffer

6. Makefile and Kconfig

Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>


# 34d0d19d 08-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

uninline d_add()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 668d0cd5 07-Mar-2016 Al Viro <viro@zeniv.linux.org.uk>

replace d_add_unique() with saner primitive

new primitive: d_exact_alias(dentry, inode). If there is an unhashed
dentry with the same name/parent and given inode, rehash, grab and
return it. Otherwise, return NULL. The only caller of d_add_unique()
switched to d_exact_alias() + d_splice_alias().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a528aca7 28-Feb-2016 Al Viro <viro@zeniv.linux.org.uk>

use ->d_seq to get coherency between ->d_inode and ->d_flags

Games with ordering and barriers are way too brittle. Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2bd03e49 14-Jan-2016 Andrew Morton <akpm@linux-foundation.org>

include/linux/dcache.h: remove semicolons from HASH_LEN_DECLARE

A little cleanup - the invocation site provdes the semicolon.

Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8db14860 17-Jul-2015 Nicolas Iooss <nicolas.iooss_linux@m4x.org>

include, lib: add __printf attributes to several function prototypes

Using __printf attributes helps to detect several format string issues
at compile time (even though -Wformat-security is currently disabled in
Makefile). For example it can detect when formatting a pointer as a
number, like the issue fixed in commit a3fa71c40f18 ("wl18xx: show
rx_frames_per_rates as an array as it really is"), or when the arguments
do not match the format string, c.f. for example commit 5ce1aca81435
("reiserfs: fix __RASSERT format string").

To prevent similar bugs in the future, add a __printf attribute to every
function prototype which needs one in include/linux/ and lib/. These
functions were mostly found by using gcc's -Wsuggest-attribute=format
flag.

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# dc3f4198 18-May-2015 Al Viro <viro@zeniv.linux.org.uk>

make simple_positive() public

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4bacc9c9 18-Jun-2015 David Howells <dhowells@redhat.com>

overlayfs: Make f_path always point to the overlay and f_inode to the underlay

Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).

Using my union testsuite to set things up, before the patch I see:

[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 13381 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 13381 Links: 1
...

After the patch:

[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 40346 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 40346 Links: 1
...

Note the change in where /proc/$$/fd/5 points to in the ls command. It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).

The inode accessed, however, is the lower layer. The union layer is on device
25h/37d and the upper layer on 24h/36d.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4bf46a27 05-Mar-2015 David Howells <dhowells@redhat.com>

VFS: Impose ordering on accesses of d_inode and d_flags

Impose ordering on accesses of d_inode and d_flags to avoid the need to do
this:

if (!dentry->d_inode || d_is_negative(dentry)) {

when this:

if (d_is_negative(dentry)) {

should suffice.

This check is especially problematic if a dentry can have its type field set
to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in
unionmount).

What we really need to do is stick a write barrier between setting d_inode and
setting d_flags and a read barrier between reading d_flags and reading
d_inode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 525d27b2 11-Feb-2015 David Howells <dhowells@redhat.com>

VFS: Add owner-filesystem positive/negative dentry checks

Supply two functions to test whether a filesystem's own dentries are positive
or negative (d_really_is_positive() and d_really_is_negative()).

The problem is that the DCACHE_ENTRY_TYPE field of dentry->d_flags may be
overridden by the union part of a layered filesystem and isn't thus
necessarily indicative of the type of dentry.

Normally, this would involve a negative dentry (ie. ->d_inode == NULL) having
->d_layer.lower pointed to a lower layer dentry, DCACHE_PINNING_LOWER set and
the DCACHE_ENTRY_TYPE field set to something other than DCACHE_MISS_TYPE - but
it could also involve, say, a DCACHE_SPECIAL_TYPE being overridden to
DCACHE_WHITEOUT_TYPE if a 0,0 chardev is detected in the top layer.

However, inside a filesystem, when that fs is looking at its own dentries, it
probably wants to know if they are really negative or not - and doesn't care
about the fallthrough bits used by the union.

To this end, a filesystem should normally use d_really_is_positive/negative()
when looking at its own dentries rather than d_is_positive/negative() and
should use d_inode() to get at the inode.

Anyone looking at someone else's dentries (this includes pathwalk) should use
d_is_xxx() and d_backing_inode().

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 44bdb5e5 28-Jan-2015 David Howells <dhowells@redhat.com>

VFS: Split DCACHE_FILE_TYPE into regular and special types

Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular
files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and
socket files).

d_is_reg() and d_is_special() are added to detect these subtypes and
d_is_file() is left as the union of the two.

This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to
use d_is_reg(dentry) instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# df1a085a 28-Jan-2015 David Howells <dhowells@redhat.com>

VFS: Add a fallthrough flag for marking virtual dentries

Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is
a virtual dentry that covers another one in a lower layer that should be used
instead. This may be recorded on medium if directory integration is stored
there.

The flag can be set with d_set_fallthru() and tested with d_is_fallthru().

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7f7d225 28-Jan-2015 David Howells <dhowells@redhat.com>

VFS: Add a whiteout dentry type

Add DCACHE_WHITEOUT_TYPE and provide a d_is_whiteout() accessor function. A
d_is_miss() accessor is also added for ordinary cache misses and
d_is_negative() is modified to indicate either an ordinary miss or an enforced
miss (whiteout).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 155e35d4 28-Jan-2015 David Howells <dhowells@redhat.com>

VFS: Introduce inode-getting helpers for layered/unioned fs environments

Introduce some function for getting the inode (and also the dentry) in an
environment where layered/unioned filesystems are in operation.

The problem is that we have places where we need *both* the union dentry and
the lower source or workspace inode or dentry available, but we can only have
a handle on one of them. Therefore we need to derive the handle to the other
from that.

The idea is to introduce an extra field in struct dentry that allows the union
dentry to refer to and pin the lower dentry.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d6cb125b 24-Dec-2014 Al Viro <viro@zeniv.linux.org.uk>

kill d_validate()

no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 41d28bca 12-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

switch d_materialise_unique() users to d_splice_alias()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b5ae6b15 12-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

merge d_materialise_unique() into d_splice_alias()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 946e51f2 26-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

move d_rcu from overlapping d_child to overlapping d_alias

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7b600f2a 12-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

don't need that forward declaration of struct nameidata in dcache.h anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 810bb172 11-Oct-2014 Al Viro <viro@zeniv.linux.org.uk>

take dname_external() into fs/dcache.c

never used outside and it's too low-level for legitimate uses outside
of fs/dcache.c anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5542aa2f 13-Feb-2014 Eric W. Biederman <ebiederm@xmission.com>

vfs: Make d_invalidate return void

Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1ffe46d1 13-Feb-2014 Eric W. Biederman <ebiederm@xmission.com>

vfs: Merge check_submounts_and_drop and d_invalidate

Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9226b5b4 14-Sep-2014 Linus Torvalds <torvalds@linux-foundation.org>

vfs: avoid non-forwarding large load after small store in path lookup

The performance regression that Josef Bacik reported in the pathname
lookup (see commit 99d263d4c5b2 "vfs: fix bad hashing of dentries") made
me look at performance stability of the dcache code, just to verify that
the problem was actually fixed. That turned up a few other problems in
this area.

There are a few cases where we exit RCU lookup mode and go to the slow
serializing case when we shouldn't, Al has fixed those and they'll come
in with the next VFS pull.

But my performance verification also shows that link_path_walk() turns
out to have a very unfortunate 32-bit store of the length and hash of
the name we look up, followed by a 64-bit read of the combined hash_len
field. That screws up the processor store to load forwarding, causing
an unnecessary hickup in this critical routine.

It's caused by the ugly calling convention for the "hash_name()"
function, and easily fixed by just making hash_name() fill in the whole
'struct qstr' rather than passing it a pointer to just the hash value.

With that, the profile for this function looks much smoother.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1a0a397e 14-Feb-2014 J. Bruce Fields <bfields@redhat.com>

dcache: d_obtain_alias callers don't all want DISCONNECTED

There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.

This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.

It isn't really a serious problem, but it would really be clearer if we
reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting
default subvol", and this replaces that workaround.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 41edf278 01-May-2014 Al Viro <viro@zeniv.linux.org.uk>

dentry_kill(): don't try to remove from shrink list

If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual. If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0). Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# da1ce067 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

vfs: add cross-rename

If flags contain RENAME_EXCHANGE then exchange source and destination files.
There's no restriction on the type of the files; e.g. a directory can be
exchanged with a symlink.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>


# 44b1d530 01-Apr-2014 Miklos Szeredi <mszeredi@suse.cz>

vfs: add d_is_dir()

Add d_is_dir(dentry) helper which is analogous to S_ISDIR().

To avoid confusion, rename d_is_directory() to d_can_lookup().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>


# a5c21dce 12-Dec-2013 Will Deacon <will@kernel.org>

dcache: allow word-at-a-time name hashing with big-endian CPUs

When explicitly hashing the end of a string with the word-at-a-time
interface, we have to be careful which end of the word we pick up.

On big-endian CPUs, the upper-bits will contain the data we're after, so
ensure we generate our masks accordingly (and avoid hashing whatever
random junk may have been sitting after the string).

This patch adds a new dcache helper, bytemask_from_count, which creates
a mask appropriate for the CPU endianness.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b18825a7 12-Sep-2013 David Howells <dhowells@redhat.com>

VFS: Put a small type field into struct dentry::d_flags

Put a type field into struct dentry::d_flags to indicate if the dentry is one
of the following types that relate particularly to pathwalk:

Miss (negative dentry)
Directory
"Automount" directory (defective - no i_op->lookup())
Symlink
Other (regular, socket, fifo, device)

The type field is set to one of the first five types on a dentry by calls to
__d_instantiate() and d_obtain_alias() from information in the inode (if one is
given).

The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
dentry as a negative dentry.

Accessors provided are:

d_set_type(dentry, type)
d_is_directory(dentry)
d_is_autodir(dentry)
d_is_symlink(dentry)
d_is_file(dentry)
d_is_negative(dentry)
d_is_positive(dentry)

A bunch of checks in pathname resolution switched to those.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b70a80e7 01-Oct-2013 Miklos Szeredi <mszeredi@suse.cz>

vfs: introduce d_instantiate_no_diralias()

...which just returns -EBUSY if a directory alias would be created.

This is to be used by fuse mkdir to make sure that a buggy or malicious
userspace filesystem doesn't do anything nasty. Previously fuse used a
private mutex for this purpose, which can now go away.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 55f841ce 27-Aug-2013 Glauber Costa <glommer@openvz.org>

super: fix calculation of shrinkable objects for small numbers

The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.

It works great in situations in which we have many objects (at least more
than 100), because the aproximation errors will be negligible. But if
this is not the case, specially when total_objects < 100, we may end up
concluding that we have no objects at all (total / 100 = 0, if total <
100).

This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3942c07c 27-Aug-2013 Glauber Costa <glommer@openvz.org>

fs: bump inode and dentry counters to long

This series reworks our current object cache shrinking infrastructure in
two main ways:

* Noticing that a lot of users copy and paste their own version of LRU
lists for objects, we put some effort in providing a generic version.
It is modeled after the filesystem users: dentries, inodes, and xfs
(for various tasks), but we expect that other users could benefit in
the near future with little or no modification. Let us know if you
have any issues.

* The underlying list_lru being proposed automatically and
transparently keeps the elements in per-node lists, and is able to
manipulate the node lists individually. Given this infrastructure, we
are able to modify the up-to-now hammer called shrink_slab to proceed
with node-reclaim instead of always searching memory from all over like
it has been doing.

Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock. The locks usually disappear from the profilers with this
change.

Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.

With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim. Historically,
those two pieces of work have been posted together. This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested. You can see more about the
history of such work at http://lwn.net/Articles/552769/

Dave Chinner (18):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.

Glauber Costa (7):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
list_lru: per-node API
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays

This patch:

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc. This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset. So we believe it is time for a change. This
patch just moves int to longs. Machines where it matters should have a
big long anyway.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8aab6a27 08-Sep-2013 Linus Torvalds <torvalds@linux-foundation.org>

vfs: reorganize dput() memory accesses

This is me being a bit OCD after all the dentry optimization work this
merge window: profiles end up showing 'dput()' as a rather expensive
operation, and there were two unrelated bad reasons for that.

The first reason was reading d_lockref.count for debugging purposes,
which touches the lockref cacheline (for reads) before really need to.
More importantly, the debugging test in question is _wrong_, and has
hidden bugs. It's true that we can only sleep when the count goes down
to zero, but the test as-is hides the much more subtle bug that happens
if we race with somebody else deleting the file.

Anyway we _will_ touch that cacheline, but let's do it for a write and
in the right routine (ie in "lockref_put_or_lock()") which annotates the
costs better. So remove the misleading debug code.

The other was an unnecessary access to the cacheline that contains the
d_lru list, just to check whether we already were on the LRU list or
not. This is exactly what we have d_flags for, so that we can avoid
touching extra cache lines for the common case. So just add another bit
for "is this dentry on the LRU".

Finally, mark the tests properly likely/unlikely, so that the common
fast-paths are dense in the instruction stream.

This makes the profiles look much saner.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f0d3b3de 04-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

constify dcache.c inlined helpers where possible

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 848ac114 05-Sep-2013 Miklos Szeredi <mszeredi@suse.cz>

vfs: check submounts and drop atomically

We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount.

Process A: have_submounts() -> returns false
Process B: mount() -> success
Process A: d_drop()

This patch prepares the ground for the fix by doing the following
operations all under the same rename lock:

have_submounts()
shrink_dcache_parent()
d_drop()

This is actually an optimization since have_submounts() and
shrink_dcache_parent() both traverse the same dentry tree separately.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: David Howells <dhowells@redhat.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 15570086 02-Sep-2013 Linus Torvalds <torvalds@linux-foundation.org>

vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()

This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
and re-implements it using the lockref infrastructure instead. It also
adds a lot of comments about what is actually going on, because turning
a dentry that was looked up using RCU into a long-lived reference
counted entry is one of the more subtle parts of the rcu walk.

We also used to be _particularly_ subtle in unlazy_walk() where we
re-validate both the dentry and its parent using the same sequence
count. We used to do it by nesting the locks and then verifying the
sequence count just once.

That was silly, because nested locking is expensive, but the sequence
count check is not. So this just re-validates the dentry and the parent
separately, avoiding the nested locking, and making the lockref lookup
possible.

Acked-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 98474236 28-Aug-2013 Waiman Long <Waiman.Long@hp.com>

vfs: make the dentry cache use the lockref infrastructure

This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.

There are no semantic changes here, it's purely syntactic. The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.

This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.

[ As with the previous commit, this is a rewritten version of a concept
originally from Waiman, so credit goes to him, blame for any errors
goes to me.

Waiman's patch had some semantic differences for taking advantage of
the lockless update in dget_parent(), while this patch is
intentionally a pure search-and-replace change with no semantic
changes. - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 118b2302 23-Aug-2013 Al Viro <viro@zeniv.linux.org.uk>

cope with potentially long ->d_dname() output for shmem/hugetlb

dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 24924a20 18-Jul-2013 Peng Tao <bergwolf@gmail.com>

vfs: constify dentry parameter in d_count()

so that it can be used in places like d_compare/d_hash
without causing a compiler warning.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 84d08fa8 05-Jul-2013 Al Viro <viro@zeniv.linux.org.uk>

helper for reading ->d_count

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# da53be12 21-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

Don't pass inode to ->d_hash() and ->d_compare()

Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 60545d0d 06-Jun-2013 Al Viro <viro@zeniv.linux.org.uk>

[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ecf3d1f1 20-Feb-2013 Jeff Layton <jlayton@kernel.org>

vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op

The following set of operations on a NFS client and server will cause

server# mkdir a
client# cd a
server# mv a a.bak
client# sleep 30 # (or whatever the dir attrcache timeout is)
client# stat .
stat: cannot stat `.': Stale NFS file handle

Obviously, we should not be getting an ESTALE error back there since the
inode still exists on the server. The problem is that the lookup code
will call d_revalidate on the dentry that "." refers to, because NFS has
FS_REVAL_DOT set.

nfs_lookup_revalidate will see that the parent directory has changed and
will try to reverify the dentry by redoing a LOOKUP. That of course
fails, so the lookup code returns ESTALE.

The problem here is that d_revalidate is really a bad fit for this case.
What we really want to know at this point is whether the inode is still
good or not, but we don't really care what name it goes by or whether
the dcache is still valid.

Add a new d_op->d_weak_revalidate operation and have complete_walk call
that instead of d_revalidate. The intent there is to allow for a
"weaker" d_revalidate that just checks to see whether the inode is still
good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
special casing.

[AV: changed method name, added note in porting, fixed confusion re
having it possibly called from RCU mode (it won't be)]

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# da2d8455 24-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

constify d_lookup() arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a713ca2a 24-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

constify __d_lookup() arguments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ad8ca374 14-Jan-2013 Jeff Layton <jlayton@kernel.org>

vfs: remove d_path_with_unreachable

The last caller was removed >2 years ago in commit 7b2a69ba7.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 39e3c955 28-Nov-2012 Jeff Layton <jlayton@kernel.org>

vfs: remove DCACHE_NEED_LOOKUP

The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.

Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b161dfa6 17-Sep-2012 Miklos Szeredi <mszeredi@suse.cz>

vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()

IBM reported a soft lockup after applying the fix for the rename_lock
deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression
in kernel 2.6.38") was found to be the culprit.

The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
dentry was killed. This flag can be set on non-killed dentries too,
which results in infinite retries when trying to traverse the dentry
tree.

This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
only set in d_kill() and makes try_to_ascend() test only this flag.

IBM reported successful test results with this patch.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b728e19 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

stop passing nameidata * to ->d_revalidate()

Just the lookup flags. Die, bastard, die...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b3d9b7a3 09-Jun-2012 Al Viro <viro@zeniv.linux.org.uk>

vfs: switch i_dentry/d_alias to hlist

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 26fe5750 10-May-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: make it possible to access the dentry hash/len as one 64-bit entry

This allows comparing hash and len in one operation on 64-bit
architectures. Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.

The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer. This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 12f8ad4b 04-May-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces

The calling conventions for __d_lookup_rcu() and dentry_cmp() are
annoying in different ways, and there is actually one single underlying
reason for both of the annoyances.

The fundamental reason is that we do the returned dentry sequence number
check inside __d_lookup_rcu() instead of doing it in the caller. This
results in two annoyances:

- __d_lookup_rcu() now not only needs to return the dentry and the
sequence number that goes along with the lookup, it also needs to
return the inode pointer that was validated by that sequence number
check.

- and because we did the sequence number check early (to validate the
name pointer and length) we also couldn't just pass the dentry itself
to dentry_cmp(), we had to pass the counted string that contained the
name.

So that sequence number decision caused two separate ugly calling
conventions.

Both of these problems would be solved if we just did the sequence
number check in the caller instead. There's only one caller, and that
caller already has to do the sequence number check for the parent
anyway, so just do that.

That allows us to stop returning the dentry->d_inode in that in-out
argument (pointer-to-pointer-to-inode), so we can make the inode
argument just a regular input inode pointer. The caller can just load
the inode from dentry->d_inode, and then do the sequence number check
after that to make sure that it's synchronized with the name we looked
up.

And it allows us to just pass in the dentry to dentry_cmp(), which is
what all the callers really wanted. Sure, dentry_cmp() has to be a bit
careful about the dentry (which is not stable during RCU lookup), but
that's actually very simple.

And now that dentry_cmp() can clearly see that the first string argument
is a dentry, we can use the direct word access for that, instead of the
careful unaligned zero-padding. The dentry name is always properly
aligned, since it is a single path component that is either embedded
into the dentry itself, or was allocated with kmalloc() (see __d_alloc).

Finally, this also uninlines the nasty slow-case for dentry comparisons:
that one *does* need to do a sequence number check, since it will call
in to the low-level filesystems, and we want to give those a stable
inode pointer and path component length/start arguments. Doing an extra
sequence check for that slow case is not a problem, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32991ab3 12-Feb-2012 Al Viro <viro@zeniv.linux.org.uk>

vfs: d_alloc_root() gone

all callers converted to d_make_root() by now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5483f18e 04-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.c

It's only used inside fs/dcache.c, and we're going to play games with it
for the word-at-a-time patches. This time we really don't even want to
export it, because it really is an internal function to fs/dcache.c, and
has been since it was introduced.

Having it in that extremely hot header file (it's included in pretty
much everything, thanks to <linux/fs.h>) is a disaster for testing
different versions, and is utterly pointless.

We really should have some kind of header file diet thing, where we
figure out which parts of header files are really better off private and
only result in more expensive compiles.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5707c87f 02-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: clarify and clean up dentry_cmp()

It did some odd things for unclear reasons. As this is one of the
functions that gets changed when doing word-at-a-time compares, this is
yet another of the "don't change any semantics, but clean things up so
that subsequent patches don't get obscured by the cleanups".

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0145acc2 02-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: uninline full_name_hash()

.. and also use it in lookup_one_len() rather than open-coding it.

There aren't any performance-critical users, so inlining it is silly.
But it wouldn't matter if it wasn't for the fact that the word-at-a-time
dentry name patches want to conditionally replace the function, and
uninlining it sets the stage for that.

So again, this is a preparatory patch that doesn't change any semantics,
and only prepares for a much cleaner and testable word-at-a-time dentry
name accessor patch.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8966be90 02-Mar-2012 Linus Torvalds <torvalds@linux-foundation.org>

vfs: trivial __d_lookup_rcu() cleanups

These don't change any semantics, but they clean up the code a bit and
mark some arguments appropriately 'const'.

They came up as I was doing the word-at-a-time dcache name accessor
code, and cleaning this up now allows me to send out a smaller relevant
interesting patch for the experimental stuff.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 46f72b34 10-Jan-2012 Sage Weil <sage@newdream.net>

vfs: export symbol d_find_any_alias()

Ceph needs this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>


# eaf5f907 10-Jan-2012 Miklos Szeredi <miklos@szeredi.hu>

fix shrink_dcache_parent() livelock

Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
cause shrink_dcache_parent() to loop forever.

Here's what appears to happen:

1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

2 - CPU1: select_parent(P) locks P->d_lock

3 - CPU0: shrink_dentry_list() locks C->d_lock
dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

4 - CPU1: select_parent(P) locks C->d_lock,
moves C from dispose list being processed on CPU0 to the new
dispose list, returns 1

5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

6 - Goto 2 with CPU0 and CPU1 switched

Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
it found a new one, causing shrink_dentry_list() to think it's making progress
and loop over and over.

One way to trigger this is to make udev calls stat() on the sysfs file while it
is going away.

Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

Then execute the following loop:

while true; do
echo -bond0 > /sys/class/net/bonding_masters
echo +bond0 > /sys/class/net/bonding_masters
echo -bond1 > /sys/class/net/bonding_masters
echo +bond1 > /sys/class/net/bonding_masters
done

One fix would be to check all callers and prevent concurrent calls to
shrink_dcache_parent(). But I think a better solution is to stop the
stealing behavior.

This patch adds a new dentry flag that is set when the dentry is added to the
dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
new reference just before being pruned.

If the dentry has this flag, select_parent() will skip it and let
shrink_dentry_list() retry pruning it. With select_parent() skipping those
dentries there will not be the appearance of progress (new dentries found) when
there is none, hence shrink_dcache_parent() will not loop forever.

Set the flag is also set in prune_dcache_sb() for consistency as suggested by
Linus.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# adc0e91a 08-Jan-2012 Al Viro <viro@zeniv.linux.org.uk>

vfs: new helper - d_make_root()

d_alloc_root() with iput() in case of allocation failure...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 02125a82 05-Dec-2011 Al Viro <viro@zeniv.linux.org.uk>

fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API

__d_path() API is asking for trouble and in case of apparmor d_namespace_path()
getting just that. The root cause is that when __d_path() misses the root
it had been told to look for, it stores the location of the most remote ancestor
in *root. Without grabbing references. Sure, at the moment of call it had
been pinned down by what we have in *path. And if we raced with umount -l, we
could have very well stopped at vfsmount/dentry that got freed as soon as
prepend_path() dropped vfsmount_lock.

It is safe to compare these pointers with pre-existing (and known to be still
alive) vfsmount and dentry, as long as all we are asking is "is it the same
address?". Dereferencing is not safe and apparmor ended up stepping into
that. d_namespace_path() really wants to examine the place where we stopped,
even if it's not connected to our namespace. As the result, it looked
at ->d_sb->s_magic of a dentry that might've been already freed by that point.
All other callers had been careful enough to avoid that, but it's really
a bad interface - it invites that kind of trouble.

The fix is fairly straightforward, even though it's bigger than I'd like:
* prepend_path() root argument becomes const.
* __d_path() is never called with NULL/NULL root. It was a kludge
to start with. Instead, we have an explicit function - d_absolute_root().
Same as __d_path(), except that it doesn't get root passed and stops where
it stops. apparmor and tomoyo are using it.
* __d_path() returns NULL on path outside of root. The main
caller is show_mountinfo() and that's precisely what we pass root for - to
skip those outside chroot jail. Those who don't want that can (and do)
use d_path().
* __d_path() root argument becomes const. Everyone agrees, I hope.
* apparmor does *NOT* try to use __d_path() or any of its variants
when it sees that path->mnt is an internal vfsmount. In that case it's
definitely not mounted anywhere and dentry_path() is exactly what we want
there. Handling of sysctl()-triggered weirdness is moved to that place.
* if apparmor is asked to do pathname relative to chroot jail
and __d_path() tells it we it's not in that jail, the sucker just calls
d_absolute_path() instead. That's the other remaining caller of __d_path(),
BTW.
* seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
the normal seq_file logics will take care of growing the buffer and redoing
the call of ->show() just fine). However, if it gets path not reachable
from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
ignoring the return value as it used to do).

Reviewed-by: John Johansen <john.johansen@canonical.com>
ACKed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org


# f0023bc6 28-Oct-2011 Sage Weil <sage@newdream.net>

vfs: add d_prune dentry operation

This adds a d_prune dentry operation that is called by the VFS prior to
pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
are about to unhash and kill the dentry.

This will be used by Ceph to maintain a flag indicating whether the
complete contents of a directory are contained in the dcache, allowing it
to satisfy lookups and readdir without addition server communication.

Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
DCACHE_OP_ bits.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 830c0f0e 06-Aug-2011 Linus Torvalds <torvalds@linux-foundation.org>

vfs: renumber DCACHE_xyz flags, remove some stale ones

Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.

And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 60063497 26-Jul-2011 Arun Sharma <asharma@fb.com>

atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ed75e95d 27-Jun-2011 Al Viro <viro@zeniv.linux.org.uk>

kill lookup_create()

folded into the only caller (kern_path_create())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 44396f4b 31-May-2011 Josef Bacik <josef@redhat.com>

fs: add a DCACHE_NEED_LOOKUP flag for d_flags

Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this

http://people.redhat.com/jwhiter/ls-noreada.png

To this

http://people.redhat.com/jwhiter/ls-good.png

Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dea3667b 24-Apr-2011 Linus Torvalds <torvalds@linux-foundation.org>

vfs: get rid of insane dentry hashing rules

The dentry hashing rules have been really quite complicated for a long
while, in odd ways. That made functions like __d_drop() very fragile
and non-obvious.

In particular, whether a dentry was hashed or not was indicated with an
explicit DCACHE_UNHASHED bit. That's despite the fact that the hash
abstraction that the dentries use actually have a 'is this entry hashed
or not' model (which is a simple test of the 'pprev' pointer).

The reason that was done is because we used the normal 'is this entry
unhashed' model to mark whether the dentry had _ever_ been hashed in the
dentry hash tables, and that logic goes back many years (commit
b3423415fbc2: "dcache: avoid RCU for never-hashed dentries").

That, in turn, meant that __d_drop had totally different unhashing logic
for the dentry hash table case and for the anonymous dcache case,
because in order to use the "is this dentry hashed" logic as a flag for
whether it had ever been on the RCU hash table, we had to unhash such a
dentry differently so that we'd never think that it wasn't 'unhashed'
and wouldn't be free'd correctly.

That's just insane. It made the logic really hard to follow, when there
were two different kinds of "unhashed" states, and one of them (the one
that used "list_bl_unhashed()") really had nothing at all to do with
being unhashed per se, but with a very subtle lifetime rule instead.

So turn all of it around, and make it logical.

Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether
the dentry is on the hash chains or not, use the hash chain unhashed
logic for that. Suddenly "d_unhashed()" just uses "list_bl_unhashed()",
and everything makes sense.

And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit.
If we ever insert the dentry into the dentry hash table so that it is
visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now
needs the RCU lifetime rules. Now suddently that test at dentry free
time makes sense too.

And because unhashing now is sane and doesn't depend on where the dentry
got unhashed from (because the dentry hash chain details doesn't have
some subtle side effects), we can re-unify the __d_drop() logic and use
common code for the unhashing.

Also fix one more open-coded hash chain bit_spin_lock() that I missed in
the previous chain locking cleanup commit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0f60f240 21-Mar-2011 David Howells <dhowells@redhat.com>

FS: lookup_mnt() is only used in the core fs routines now

lookup_mnt() is only used in the core fs routines now, so it doesn't need to
be globally declared anymore. It isn't exported to modules at the moment, so
nothing that can be modularised seems to be using it.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1aed3e42 18-Mar-2011 Al Viro <viro@zeniv.linux.org.uk>

lose 'mounting_here' argument in ->d_manage()

it's always false...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ab90911f 14-Jan-2011 David Howells <dhowells@redhat.com>

Allow d_manage() to be used in RCU-walk mode

Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).

autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cc53ce53 14-Jan-2011 David Howells <dhowells@redhat.com>

Add a dentry op to allow processes to be held during pathwalk transit

Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).

The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.

The ->d_manage() dentry operation:

int (*d_manage)(struct path *path, bool mounting_here);

takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.

->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.

Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.

follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).

A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.

__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.

Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.

==========================
WHAT THIS MEANS FOR AUTOFS
==========================

Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.

autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.

The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:

mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4

versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:

automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

which means that the system is deadlocked.

This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9875cf80 14-Jan-2011 David Howells <dhowells@redhat.com>

Add a dentry op to handle automounting rather than abusing follow_link()

Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).

This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

The ->d_automount() dentry operation:

struct vfsmount *(*d_automount)(struct path *mountpoint);

takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.

The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.

Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.

__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).

__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.

follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".

I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.

I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.

[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2a8c0c68 08-Jan-2011 Randy Dunlap <randy.dunlap@oracle.com>

fs: fix dcache.h kernel-doc notation

Fix new kernel-doc notation warning in dcache.h:

Warning(include/linux/dcache.h:316): Excess function parameter 'Returns' description in '__d_rcu_to_refcount'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d55c369 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: implement faster dentry memcmp

The standard memcmp function on a Westmere system shows up hot in
profiles in the `git diff` workload (both parallel and single threaded),
and it is likely due to the costs associated with trapping into
microcode, and little opportunity to improve memory access (dentry
name is not likely to take up more than a cacheline).

So replace it with an open-coded byte comparison. This increases code
size by 8 bytes in the critical __d_lookup_rcu function, but the
speedup is huge, averaging 10 runs of each:

git diff st user sys elapsed CPU
before 1.15 2.57 3.82 97.1
after 1.14 2.35 3.61 96.8

git diff mt user sys elapsed CPU
before 1.27 3.85 1.46 349
after 1.26 3.54 1.43 333

Elapsed time for single threaded git diff at 95.0% confidence:
-0.21 +/- 0.01
-5.45% +/- 0.24%

It's -0.66% +/- 0.06% elapsed time on my Opteron, so rep cmp costs on the
fam10h seem to be relatively smaller, but there is still a win.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 4b936885 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: improve scalability of pseudo filesystems

Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 873feea0 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache per-inode inode alias locking

dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# ceb5bdc2 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache per-bucket dcache hash locking

We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 34286d66 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: rcu-walk aware d_revalidate method

Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 44a7d7a8 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: cache optimise dentry and inode for rcu-walk

Put dentry and inode fields into top of data structure. This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.

We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.

inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.

This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.

When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.

Align dentry and inode operations structs, so members will be at predictable
offsets and we can group common operations into head of structure.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# fb045adb 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache reduce branches in lookup path

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 5f57cbcc 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache remove d_mounted

Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
The flag can be cleared by checking the hash table to see if there are any
mounts left, which is not time critical because it is performed at detach time.

The mounted state of a dentry is only used to speculatively take a look in the
mount hash table if it is set -- before following the mount, vfsmount lock is
taken and mount re-checked without races.

This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).

Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.

At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.

XXX: autofs may not work right if we have other mounts go over the top of it?

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 31e6b01f 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: rcu-walk for path lookup

Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# dc0474be 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache rationalise dget variants

dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b5c84bf6 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache remove dcache_lock

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 949854d0 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: Use rename lock and RCU for multi-step operations

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b23fb0a6 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: scale inode alias list

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 2fd6b7f5 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache scale subdirs

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b7ab39f6 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache scale dentry refcount

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 789680d1 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache scale hash

Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# ec2447c2 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

hostfs: simplify locking

Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# b1e6a015 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: change d_hash for rcu-walk

Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 621e155a 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: change d_compare for rcu-walk

Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# fb2d5b86 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: name case update method

smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# fe15ce44 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: change d_delete semantics

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 5eef7fa9 06-Jan-2011 Nick Piggin <npiggin@kernel.dk>

fs: dcache documentation cleanup

Remove redundant (and incorrect, since dcache RCU lookup) dentry locking
documentation and point to the canonical document.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


# 8df9d1a4 10-Aug-2010 Miklos Szeredi <mszeredi@suse.cz>

vfs: show unreachable paths in getcwd and proc

Prepend "(unreachable)" to path strings if the path is not reachable
from the current root.

Two places updated are
- the return string from getcwd()
- and symlinks under /proc/$PID.

Other uses of d_path() are left unchanged (we know that some old
software crashes if /proc/mounts is changed).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c103135c 06-Jun-2010 Al Viro <viro@zeniv.linux.org.uk>

new helper: __dentry_path()

builds path relative to fs root, called under dcache_lock,
doesn't append any nonsense to unlinked ones.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d83c49f3 30-Apr-2010 Al Viro <viro@zeniv.linux.org.uk>

Fix the regression created by "set S_DEAD on unlink()..." commit

1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.

2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.

3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f3da392e 03-May-2009 Alexey Dobriyan <adobriyan@gmail.com>

dcache: extrace and use d_unlinked()

d_unlinked() will be used in middle-term to ban checkpointing when opened
but unlinked file is detected, and in long term, to detect such situation
and special case on it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1c755af4 18-Apr-2009 Al Viro <viro@zeniv.linux.org.uk>

switch lookup_mnt()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c28f7e56 21-May-2009 Eric Paris <eparis@redhat.com>

fsnotify: parent event notification

inotify and dnotify both use a similar parent notification mechanism. We
add a generic parent notification mechanism to fsnotify for both of these
to use. This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>


# e56980d4 11-Feb-2009 Jan Engelhardt <jengelh@medozas.de>

fs: make struct dentry->d_op const

This change will allow for tagging many dentry_operations const in the
source tree.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c2452f32 01-Dec-2008 Nick Piggin <npiggin@suse.de>

shrink struct dentry

struct dentry is one of the most critical structures in the kernel. So it's
sad to see it going neglected.

With CONFIG_PROFILING turned on (which is probably the common case at least
for distros and kernel developers), sizeof(struct dcache) == 208 here
(64-bit). This gives 19 objects per slab.

I packed d_mounted into a hole, and took another 4 bytes off the inline
name length to take the padding out from the end of the structure. This
shinks it to 200 bytes. I could have gone the other way and increased the
length to 40, but I'm aiming for a magic number, read on...

I then got rid of the d_cookie pointer. This shrinks it to 192 bytes. Rant:
why was this ever a good idea? The cookie system should increase its hash
size or use a tree or something if lookups are a problem. Also the "fast
dcookie lookups" in oprofile should be moved into the dcookie code -- how
can oprofile possibly care about the dcookie_mutex? It gets dropped after
get_dcookie() returns so it can't be providing any sort of protection.

At 192 bytes, 21 objects fit into a 4K page, saving about 3MB on my system
with ~140 000 entries allocated. 192 is also a multiple of 64, so we get
nice cacheline alignment on 64 and 32 byte line systems -- any given dentry
will now require 3 cachelines to touch all fields wheras previously it
would require 4.

I know the inline name size was chosen quite carefully, however with the
reduction in cacheline footprint, it should actually be just about as fast
to do a name lookup for a 36 character name as it was before the patch (and
faster for other sizes). The memory footprint savings for names which are
<= 32 or > 36 bytes long should more than make up for the memory cost for
33-36 byte names.

Performance is a feature...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e2761a11 15-Oct-2008 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

[PATCH vfs-2.6 2/6] vfs: add d_ancestor()

This adds d_ancestor() instead of d_isparent(), then use it.

If new_dentry == old_dentry, is_subdir() returns 1, looks strange.
"new_dentry == old_dentry" is not subdir obviously. But I'm not
checking callers for now, so this keeps current behavior.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>


# 9308a612 11-Aug-2008 Christoph Hellwig <hch@lst.de>

[PATCH] kill d_alloc_anon

Remove d_alloc_anon now that no users are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 4ea3ada2 11-Aug-2008 Christoph Hellwig <hch@lst.de>

[PATCH] new helper: d_obtain_alias

The calling conventions of d_alloc_anon are rather unfortunate for all
users, and it's name is not very descriptive either.

Add d_obtain_alias as a new exported helper that drops the inode
reference in the failure case, too and allows to pass-through NULL
pointers and inodes to allow for tail-calls in the export operations.

Incidentally this helper already existed as a private function in
libfs.c as exportfs_d_alloc so kill that one and switch the callers
to d_obtain_alias.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e45b590b 07-Aug-2008 Christoph Hellwig <hch@lst.de>

[PATCH] change d_add_ci argument ordering

As pointed out during review d_add_ci argument order should match d_add,
so switch the dentry and inode arguments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 9403540c 21-May-2008 Barry Naujok <bnaujok@sgi.com>

dcache: Add case-insensitive support d_ci_add() routine

This add a dcache entry to the dcache for lookup, but changing the name
that is associated with the entry rather than the one passed in to the
lookup routine.

First, it sees if the case-exact match already exists in the dcache and
uses it if one exists. Otherwise, it allocates a new node with the new
name and splices it into the dcache.

Original code from ntfs_lookup in fs/ntfs/namei.c by Anton Altaparmakov.

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Christoph Hellwig <hch@infradead.org>


# 20d4fdc1 09-Jun-2008 Jan Engelhardt <jengelh@medozas.de>

[patch 2/4] fs: make struct file arg to d_path const

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 82524746 12-May-2008 Franck Bui-Huu <fbuihuu@gmail.com>

rcu: split list.h and move rcu-protected lists into rculist.h

Move rcu-protected lists from list.h into a new header file rculist.h.

This is done because list are a very used primitive structure all over the
kernel and it's currently impossible to include other header files in this
list.h without creating some circular dependencies.

For example, list.h implements rcu-protected list and uses rcu_dereference()
without including rcupdate.h. It actually compiles because users of
rcu_dereference() are macros. Others RCU functions could be used too but
aren't probably because of this.

Therefore this patch creates rculist.h which includes rcupdates without to
many changes/troubles.

Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 735643ee 30-Apr-2008 Robert P. J. Day <rpjday@crashcourse.ca>

Remove "#ifdef __KERNEL__" checks from unexported headers

Remove the "#ifdef __KERNEL__" tests from unexported header files in
linux/include whose entire contents are wrapped in that preprocessor
test.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9d1bc601 27-Mar-2008 Miklos Szeredi <mszeredi@suse.cz>

[patch 2/7] vfs: mountinfo: add seq_file_root()

Add a new function:

seq_file_root()

This is similar to seq_path(), but calculates the path relative to the
given root, instead of current->fs->root. If the path was unreachable
from root, then modify the root parameter to reflect this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6092d048 27-Mar-2008 Ram Pai <linuxram@us.ibm.com>

[patch 1/7] vfs: mountinfo: add dentry_path()

[mszeredi@suse.cz] split big patch into managable chunks

Add the following functions:

dentry_path()
seq_dentry()

These are similar to d_path() and seq_path(). But instead of
calculating the path within a mount namespace, they calculate the path
from the root of the filesystem to a given dentry, ignoring mounts
completely.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6d59e7f5 22-Mar-2008 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] move a bunch of declarations to fs/internal.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# cf28b486 14-Feb-2008 Jan Blunck <jblunck@suse.de>

d_path: Make d_path() use a struct path

d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 74c3cbe3 22-Jul-2007 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] audit: watching subtrees

New kind of audit rule predicates: "object is visible in given subtree".
The part that can be sanely implemented, that is. Limitations:
* if you have hardlink from outside of tree, you'd better watch
it too (or just watch the object itself, obviously)
* if you mount something under a watched tree, tell audit
that new chunk should be added to watched subtrees
* if you umount something in a watched tree and it's still mounted
elsewhere, you will get matches on events happening there. New command
tells audit to recalculate the trees, trimming such sources of false
positives.

Note that it's _not_ about path - if something mounted in several places
(multiple mount, bindings, different namespaces, etc.), the match does
_not_ depend on which one we are using for access.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c23fbb6b 08-May-2007 Eric Dumazet <dada1@cosmosbay.com>

VFS: delay the dentry name generation on sockets and pipes

1) Introduces a new method in 'struct dentry_operations'. This method
called d_dname() might be called from d_path() to build a pathname for
special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all
pipes/sockets) may need to change prototype of this method, but we now
use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);

2) Adds a dynamic_dname() helper function that eases d_dname() implementations

3) Defines d_dname method for sockets : No more sprintf() at socket
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...

4) Defines d_dname method for pipes : No more sprintf() at pipe
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c636ebdb 11-Oct-2006 David Howells <dhowells@redhat.com>

[PATCH] VFS: Destroy the dentries contributed by a superblock on unmounting

The attached patch destroys all the dentries attached to a superblock in one go
by:

(1) Destroying the tree rooted at s_root.

(2) Destroying every entry in the anon list, one at a time.

(3) Each entry in the anon list has its subtree consumed from the leaves
inwards.

This reduces the amount of work generic_shutdown_super() does, and avoids
iterating through the dentry_unused list.

Note that locking is almost entirely absent in the shrink_dcache_for_umount*()
functions added by this patch. This is because:

(1) at the point the filesystem calls generic_shutdown_super(), it is not
permitted to further touch the superblock's set of dentries, and nor may
it remove aliases from inodes;

(2) the dcache memory shrinker now skips dentries that are being unmounted;
and

(3) the superblock no longer has any external references through which the VFS
can reach it.

Given these points, the only locking we need to do is when we remove dentries
from the unused list and the name hashes, which we do a directory's worth at a
time.

We also don't need to guard against reference counts going to zero unexpectedly
and removing bits of the tree we're working on as nothing else can call dput().

A cut down version of dentry_iput() has been folded into
shrink_dcache_for_umount_subtree() function. Apart from not needing to unlock
things, it also doesn't need to check for inotify watches.

In this version of the patch, the complaint about a dentry still being in use
has been expanded from a single BUG_ON() and now gives much more information.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 770bfad8 22-Aug-2006 David Howells <dhowells@redhat.com>

NFS: Add dentry materialisation op

The attached patch adds a new directory cache management function that prepares
a disconnected anonymous function to be connected into the dentry tree. The
anonymous dentry is transferred the name and parentage from another dentry.

The following changes were made in [try #2]:

(*) d_materialise_dentry() now switches the parentage of the two nodes around
correctly when one or other of them is self-referential.

The following changes were made in [try #7]:

(*) d_instantiate_unique() has had the interior part split out as function
__d_instantiate_unique(). Callers of this latter function must be holding
the appropriate locks.

(*) _d_rehash() has been added as a wrapper around __d_rehash() to call it
with the most obvious hash list (the one from the name). d_rehash() now
calls _d_rehash().

(*) d_materialise_dentry() is now __d_materialise_dentry() and is static.

(*) d_materialise_unique() added to perform the combination of d_find_alias(),
d_materialise_dentry() and d_add_unique() that the NFS client was doing
twice, all within a single dcache_lock critical section. This reduces the
number of times two different spinlocks were being accessed.

The following further changes were made:

(*) Add the dentries onto their parents d_subdirs lists.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>


# a90b9c05 03-Jul-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] lockdep: annotate dcache

Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 454e2398 23-Jun-2006 David Howells <dhowells@redhat.com>

[PATCH] VFS: Permit filesystem to override root dentry on mount

Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.

(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().

(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().

This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.

However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.

[*] Anonymous until discovered from another tree.

(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0feae5c4 22-Jun-2006 NeilBrown <neilb@suse.de>

[PATCH] Fix dcache race during umount

The race is that the shrink_dcache_memory shrinker could get called while a
filesystem is being unmounted, and could try to prune a dentry belonging to
that filesystem.

If it does, then it will call in to iput on the inode while the dentry is
no longer able to be found by the umounting process. If iput takes a
while, generic_shutdown_super could get all the way though
shrink_dcache_parent and shrink_dcache_anon and invalidate_inodes without
ever waiting on this particular inode.

Eventually the superblock gets freed anyway and if the iput tried to touch
it (which some filesystems certainly do), it will lose. The promised
"Self-destruct in 5 seconds" doesn't lead to a nice day.

The race is closed by holding s_umount while calling prune_one_dentry on
someone else's dentry. As a down_read_trylock is used,
shrink_dcache_memory will no longer try to prune the dentry of a filesystem
that is being unmounted, and unmount will not be able to start until any
such active prune_one_dentry completes.

This requires that prune_dcache *knows* which filesystem (if any) it is
doing the prune on behalf of so that it can be careful of other
filesystems. shrink_dcache_memory isn't called it on behalf of any
filesystem, and so is careful of everything.

shrink_dcache_anon is now passed a super_block rather than the s_anon list
out of the superblock, so it can get the s_anon list itself, and can pass
the superblock down to prune_dcache.

If prune_dcache finds a dentry that it cannot free, it leaves it where it
is (at the tail of the list) and exits, on the assumption that some other
thread will be removing that dentry soon. To try to make sure that some
work gets done, a limited number of dnetries which are untouchable are
skipped over while choosing the dentry to work on.

I believe this race was first found by Kirill Korotaev.

Cc: Jan Blunck <jblunck@suse.de>
Acked-by: Kirill Korotaev <dev@openvz.org>
Cc: Olaf Hering <olh@suse.de>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3e7e241f 31-Mar-2006 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] dcache: Add helper d_hash_and_lookup

It is very common to hash a dentry and then to call lookup. If we take fs
specific hash functions into account the full hash logic can get ugly.
Further full_name_hash as an inline function is almost 100 bytes on x86 so
having a non-inline choice in some cases can measurably decrease code size.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# c32ccd87 25-Mar-2006 Nick Piggin <nickpiggin@yahoo.com.au>

[PATCH] inotify: lock avoidance with parent watch status in dentry

Previous inotify work avoidance is good when inotify is completely unused,
but it breaks down if even a single watch is in place anywhere in the
system. Robin Holt notices that udev is one such culprit - it slows down a
512-thread application on a 512 CPU system from 6 seconds to 22 minutes.

Solve this by adding a flag in the dentry that tells inotify whether or not
its parent inode has a watch on it. Event queueing to parent will skip
taking locks if this flag is cleared. Setting and clearing of this flag on
all child dentries versus event delivery: this is no in terms of race
cases, and that was shown to be equivalent to always performing the check.

The essential behaviour is that activity occuring _after_ a watch has been
added and _before_ it has been removed, will generate events.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1b862354 14-Dec-2005 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] remove bogus asm/bug.h includes.

A bunch of asm/bug.h includes are both not needed (since it will get
pulled anyway) and bogus (since they are done too early). Removed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 47ba87e0 03-Feb-2006 Marcelo Tosatti <marcelo.tosatti@cyclades.com>

[PATCH] make "struct d_cookie" depend on CONFIG_PROFILING

Shrinks "struct dentry" from 128 bytes to 124 on x86, allowing 31 objects
per slab instead of 30.

Cc: John Levon <levon@movementarian.org>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5160ee6f 08-Jan-2006 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] shrink dentry struct

Some long time ago, dentry struct was carefully tuned so that on 32 bits
UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
of memory cache lines.

Then RCU was added and dentry struct enlarged by two pointers, with nice
results for SMP, but not so good on UP, because breaking the above tuning
(128 + 8 = 136 bytes)

This patch reverts this unwanted side effect, by using an union (d_u),
where d_rcu and d_child are placed so that these two fields can share their
memory needs.

At the time d_free() is called (and d_rcu is really used), d_child is known
to be empty and not touched by the dentry freeing.

Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
the previous content of d_child is not needed if said dentry was unhashed
but still accessed by a CPU because of RCU constraints)

As dentry cache easily contains millions of entries, a size reduction is
worth the extra complexity of the ugly C union.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Paul Jackson <pj@sgi.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a05964f3 07-Nov-2005 Ram Pai <linuxram@us.ibm.com>

[PATCH] shared mounts handling: umount

An unmount of a mount creates a umount event on the parent. If the
parent is a shared mount, it gets propagated to all mounts in the peer
group.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 3f4bb1f4 06-Sep-2005 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] struct dentry: place d_hash close to d_parent and d_name to speedup lookups

dentry cache uses sophisticated RCU technology (and prefetching if
available) but touches 2 cache lines per dentry during hlist lookup.

This patch moves d_hash in the same cache line than d_parent and d_name
fields so that :

1) One cache line is needed instead of two.

2) the hlist_for_each_rcu() prefetching has a chance to bring all the
needed data in advance, not only the part that includes d_hash.next.

I also changed one old comment that was wrong for 64bits.

A further optimisation would be to separate dentry in two parts, one that
is mostly read, and one writen (d_count/d_lock) to avoid false sharing on
SMP/NUMA but this would need different field placement depending on 32bits
or 64bits platform.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!